首页 > 最新文献

Statistical Applications in Genetics and Molecular Biology最新文献

英文 中文
Estimating intrinsic and extrinsic noise from single-cell gene expression measurements. 估计单细胞基因表达测量的内在和外在噪声。
IF 0.9 4区 数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2016-12-01 DOI: 10.1515/sagmb-2016-0002
Audrey Qiuyan Fu, Lior Pachter

Gene expression is stochastic and displays variation ("noise") both within and between cells. Intracellular (intrinsic) variance can be distinguished from extracellular (extrinsic) variance by applying the law of total variance to data from two-reporter assays that probe expression of identically regulated gene pairs in single cells. We examine established formulas [Elowitz, M. B., A. J. Levine, E. D. Siggia and P. S. Swain (2002): "Stochastic gene expression in a single cell," Science, 297, 1183-1186.] for the estimation of intrinsic and extrinsic noise and provide interpretations of them in terms of a hierarchical model. This allows us to derive alternative estimators that minimize bias or mean squared error. We provide a geometric interpretation of these results that clarifies the interpretation in [Elowitz, M. B., A. J. Levine, E. D. Siggia and P. S. Swain (2002): "Stochastic gene expression in a single cell," Science, 297, 1183-1186.]. We also demonstrate through simulation and re-analysis of published data that the distribution assumptions underlying the hierarchical model have to be satisfied for the estimators to produce sensible results, which highlights the importance of normalization.

基因表达是随机的,在细胞内和细胞间都表现出变异(“噪音”)。细胞内(内在)变异可以与细胞外(外在)变异区分开来,方法是将总变异定律应用于双报告基因试验的数据,该试验探测单细胞中相同调控基因对的表达。我们检验已建立的公式[Elowitz, M. B., a . J. Levine, E. D. Siggia和P. S. Swain(2002):“单个细胞中的随机基因表达”,《科学》,297,1183-1186。]用于估计内在和外在噪声,并根据层次模型提供对它们的解释。这使我们能够推导出最小化偏差或均方误差的替代估计器。我们提供了这些结果的几何解释,澄清了[Elowitz, M. B., a . J. Levine, E. D. Siggia和P. S. Swain(2002):“单个细胞中的随机基因表达”,《科学》,297,1183-1186.]中的解释。我们还通过模拟和对已发表数据的重新分析证明,为了产生合理的结果,估计器必须满足层次模型背后的分布假设,这突出了归一化的重要性。
{"title":"Estimating intrinsic and extrinsic noise from single-cell gene expression measurements.","authors":"Audrey Qiuyan Fu,&nbsp;Lior Pachter","doi":"10.1515/sagmb-2016-0002","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0002","url":null,"abstract":"<p><p>Gene expression is stochastic and displays variation (\"noise\") both within and between cells. Intracellular (intrinsic) variance can be distinguished from extracellular (extrinsic) variance by applying the law of total variance to data from two-reporter assays that probe expression of identically regulated gene pairs in single cells. We examine established formulas [Elowitz, M. B., A. J. Levine, E. D. Siggia and P. S. Swain (2002): \"Stochastic gene expression in a single cell,\" Science, 297, 1183-1186.] for the estimation of intrinsic and extrinsic noise and provide interpretations of them in terms of a hierarchical model. This allows us to derive alternative estimators that minimize bias or mean squared error. We provide a geometric interpretation of these results that clarifies the interpretation in [Elowitz, M. B., A. J. Levine, E. D. Siggia and P. S. Swain (2002): \"Stochastic gene expression in a single cell,\" Science, 297, 1183-1186.]. We also demonstrate through simulation and re-analysis of published data that the distribution assumptions underlying the hierarchical model have to be satisfied for the estimators to produce sensible results, which highlights the importance of normalization.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 6","pages":"447-471"},"PeriodicalIF":0.9,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0002","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39981816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Search of latent periodicity in amino acid sequences by means of genetic algorithm and dynamic programming. 基于遗传算法和动态规划的氨基酸序列潜在周期性搜索。
IF 0.9 4区 数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2016-10-01 DOI: 10.1515/sagmb-2015-0079
Valentina Pugacheva, Alexander Korotkov, Eugene Korotkov

The aim of this study was to show that amino acid sequences have a latent periodicity with insertions and deletions of amino acids in unknown positions of the analyzed sequence. Genetic algorithm, dynamic programming and random weight matrices were used to develop a new mathematical algorithm for latent periodicity search. A multiple alignment of periods was calculated with help of the direct optimization of the position-weight matrix without using pairwise alignments. The developed algorithm was applied to analyze amino acid sequences of a small number of proteins. This study showed the presence of latent periodicity with insertions and deletions in the amino acid sequences of such proteins, for which the presence of latent periodicity was not previously known. The origin of latent periodicity with insertions and deletions is discussed.

本研究的目的是表明氨基酸序列具有潜在的周期性,在所分析的序列中未知位置的氨基酸插入和缺失。利用遗传算法、动态规划和随机权矩阵,提出了一种新的隐周期搜索数学算法。在不使用成对对齐的情况下,通过直接优化位置-权重矩阵来计算周期的多次对齐。该算法用于分析少量蛋白质的氨基酸序列。本研究表明,在这些蛋白质的氨基酸序列中存在插入和缺失的潜在周期性,这是以前不知道的。讨论了带有插入和删除的潜在周期性的起源。
{"title":"Search of latent periodicity in amino acid sequences by means of genetic algorithm and dynamic programming.","authors":"Valentina Pugacheva,&nbsp;Alexander Korotkov,&nbsp;Eugene Korotkov","doi":"10.1515/sagmb-2015-0079","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0079","url":null,"abstract":"<p><p>The aim of this study was to show that amino acid sequences have a latent periodicity with insertions and deletions of amino acids in unknown positions of the analyzed sequence. Genetic algorithm, dynamic programming and random weight matrices were used to develop a new mathematical algorithm for latent periodicity search. A multiple alignment of periods was calculated with help of the direct optimization of the position-weight matrix without using pairwise alignments. The developed algorithm was applied to analyze amino acid sequences of a small number of proteins. This study showed the presence of latent periodicity with insertions and deletions in the amino acid sequences of such proteins, for which the presence of latent periodicity was not previously known. The origin of latent periodicity with insertions and deletions is discussed.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 5","pages":"381-400"},"PeriodicalIF":0.9,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0079","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34670029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Evaluation of low-template DNA profiles using peak heights. 利用峰高评价低模板DNA谱。
IF 0.9 4区 数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2016-10-01 DOI: 10.1515/sagmb-2016-0038
Christopher D Steele, Matthew Greenhalgh, David J Balding

In recent years statistical models for the analysis of complex (low-template and/or mixed) DNA profiles have moved from using only presence/absence information about allelic peaks in an electropherogram, to quantitative use of peak heights. This is challenging because peak heights are very variable and affected by a number of factors. We present a new peak-height model with important novel features, including over- and double-stutter, and a new approach to dropin. Our model is incorporated in open-source R code likeLTD. We apply it to 108 laboratory-generated crime-scene profiles and demonstrate techniques of model validation that are novel in the field. We use the results to explore the benefits of modeling peak heights, finding that it is not always advantageous, and to assess the merits of pre-extraction replication. We also introduce an approximation that can reduce computational complexity when there are multiple low-level contributors who are not of interest to the investigation, and we present a simple approximate adjustment for linkage between loci, making it possible to accommodate linkage when evaluating complex DNA profiles.

近年来,用于分析复杂(低模板和/或混合)DNA谱的统计模型已经从仅使用电泳图中等位基因峰的存在/缺失信息转变为峰高的定量使用。这是具有挑战性的,因为高峰高度变化很大,受到许多因素的影响。我们提出了一个新的峰高模型,它具有重要的新特征,包括过卡顿和双卡顿,以及一种新的下降方法。我们的模型包含在开源的R代码中。我们将其应用于108个实验室生成的犯罪现场概况,并演示了该领域新颖的模型验证技术。我们使用这些结果来探索峰高建模的好处,发现它并不总是有利的,并评估预提取复制的优点。我们还引入了一个近似值,当存在多个对调查不感兴趣的低水平贡献者时,可以降低计算复杂性,并且我们提出了一个简单的基因座之间连锁的近似调整,使得在评估复杂DNA谱时可以容纳连锁。
{"title":"Evaluation of low-template DNA profiles using peak heights.","authors":"Christopher D Steele,&nbsp;Matthew Greenhalgh,&nbsp;David J Balding","doi":"10.1515/sagmb-2016-0038","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0038","url":null,"abstract":"<p><p>In recent years statistical models for the analysis of complex (low-template and/or mixed) DNA profiles have moved from using only presence/absence information about allelic peaks in an electropherogram, to quantitative use of peak heights. This is challenging because peak heights are very variable and affected by a number of factors. We present a new peak-height model with important novel features, including over- and double-stutter, and a new approach to dropin. Our model is incorporated in open-source R code likeLTD. We apply it to 108 laboratory-generated crime-scene profiles and demonstrate techniques of model validation that are novel in the field. We use the results to explore the benefits of modeling peak heights, finding that it is not always advantageous, and to assess the merits of pre-extraction replication. We also introduce an approximation that can reduce computational complexity when there are multiple low-level contributors who are not of interest to the investigation, and we present a simple approximate adjustment for linkage between loci, making it possible to accommodate linkage when evaluating complex DNA profiles.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 5","pages":"431-445"},"PeriodicalIF":0.9,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0038","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34733269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
The use of vector bootstrapping to improve variable selection precision in Lasso models. 利用向量自举提高Lasso模型的变量选择精度。
IF 0.9 4区 数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2016-08-01 DOI: 10.1515/sagmb-2015-0043
Charles Laurin, Dorret Boomsma, Gitta Lubke

The Lasso is a shrinkage regression method that is widely used for variable selection in statistical genetics. Commonly, K-fold cross-validation is used to fit a Lasso model. This is sometimes followed by using bootstrap confidence intervals to improve precision in the resulting variable selections. Nesting cross-validation within bootstrapping could provide further improvements in precision, but this has not been investigated systematically. We performed simulation studies of Lasso variable selection precision (VSP) with and without nesting cross-validation within bootstrapping. Data were simulated to represent genomic data under a polygenic model as well as under a model with effect sizes representative of typical GWAS results. We compared these approaches to each other as well as to software defaults for the Lasso. Nested cross-validation had the most precise variable selection at small effect sizes. At larger effect sizes, there was no advantage to nesting. We illustrated the nested approach with empirical data comprising SNPs and SNP-SNP interactions from the most significant SNPs in a GWAS of borderline personality symptoms. In the empirical example, we found that the default Lasso selected low-reliability SNPs and interactions which were excluded by bootstrapping.

Lasso是一种广泛应用于统计遗传学变量选择的收缩回归方法。通常,K-fold交叉验证用于拟合Lasso模型。有时还会使用自举置信区间来提高结果变量选择的精度。在引导中嵌套交叉验证可以进一步提高精度,但这还没有被系统地研究过。我们进行了Lasso变量选择精度(VSP)的模拟研究,并在引导中进行了嵌套交叉验证和不嵌套交叉验证。在多基因模型和典型GWAS结果的效应量模型下,对数据进行模拟,以表示基因组数据。我们将这些方法相互比较,并与Lasso的软件默认值进行比较。嵌套交叉验证在小效应量下具有最精确的变量选择。在更大的效应量下,嵌套没有任何优势。我们用经验数据说明了嵌套方法,这些数据包括来自边缘型人格症状的GWAS中最显著snp的snp和SNP-SNP相互作用。在实证示例中,我们发现默认Lasso选择了低可靠性snp和相互作用,这些snp和相互作用被自举排除在外。
{"title":"The use of vector bootstrapping to improve variable selection precision in Lasso models.","authors":"Charles Laurin, Dorret Boomsma, Gitta Lubke","doi":"10.1515/sagmb-2015-0043","DOIUrl":"10.1515/sagmb-2015-0043","url":null,"abstract":"<p><p>The Lasso is a shrinkage regression method that is widely used for variable selection in statistical genetics. Commonly, K-fold cross-validation is used to fit a Lasso model. This is sometimes followed by using bootstrap confidence intervals to improve precision in the resulting variable selections. Nesting cross-validation within bootstrapping could provide further improvements in precision, but this has not been investigated systematically. We performed simulation studies of Lasso variable selection precision (VSP) with and without nesting cross-validation within bootstrapping. Data were simulated to represent genomic data under a polygenic model as well as under a model with effect sizes representative of typical GWAS results. We compared these approaches to each other as well as to software defaults for the Lasso. Nested cross-validation had the most precise variable selection at small effect sizes. At larger effect sizes, there was no advantage to nesting. We illustrated the nested approach with empirical data comprising SNPs and SNP-SNP interactions from the most significant SNPs in a GWAS of borderline personality symptoms. In the empirical example, we found that the default Lasso selected low-reliability SNPs and interactions which were excluded by bootstrapping.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 4","pages":"305-20"},"PeriodicalIF":0.9,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0043","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34536049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Bayesian state space models for dynamic genetic network construction across multiple tissues. 多组织动态遗传网络构建的贝叶斯状态空间模型。
IF 0.9 4区 数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2016-08-01 DOI: 10.1515/sagmb-2014-0055
Yulan Liang, Arpad Kelemen

Construction of gene-gene interaction networks and potential pathways is a challenging and important problem in genomic research for complex diseases while estimating the dynamic changes of the temporal correlations and non-stationarity are the keys in this process. In this paper, we develop dynamic state space models with hierarchical Bayesian settings to tackle this challenge for inferring the dynamic profiles and genetic networks associated with disease treatments. We treat both the stochastic transition matrix and the observation matrix time-variant and include temporal correlation structures in the covariance matrix estimations in the multivariate Bayesian state space models. The unevenly spaced short time courses with unseen time points are treated as hidden state variables. Hierarchical Bayesian approaches with various prior and hyper-prior models with Monte Carlo Markov Chain and Gibbs sampling algorithms are used to estimate the model parameters and the hidden state variables. We apply the proposed Hierarchical Bayesian state space models to multiple tissues (liver, skeletal muscle, and kidney) Affymetrix time course data sets following corticosteroid (CS) drug administration. Both simulation and real data analysis results show that the genomic changes over time and gene-gene interaction in response to CS treatment can be well captured by the proposed models. The proposed dynamic Hierarchical Bayesian state space modeling approaches could be expanded and applied to other large scale genomic data, such as next generation sequence (NGS) combined with real time and time varying electronic health record (EHR) for more comprehensive and robust systematic and network based analysis in order to transform big biomedical data into predictions and diagnostics for precision medicine and personalized healthcare with better decision making and patient outcomes.

基因-基因相互作用网络和潜在通路的构建是复杂疾病基因组研究中的一个重要而具有挑战性的问题,而时间相关性和非平稳性的动态变化是这一过程的关键。在本文中,我们开发了具有分层贝叶斯设置的动态状态空间模型来解决这一挑战,以推断与疾病治疗相关的动态概况和遗传网络。在多元贝叶斯状态空间模型中,我们将随机转移矩阵和观测矩阵视为时变的,并在协方差矩阵估计中包含时间相关结构。具有不可见时间点的不均匀间隔短时间过程被视为隐藏状态变量。采用蒙特卡罗马尔可夫链和吉布斯抽样算法,采用具有各种先验和超先验模型的层次贝叶斯方法估计模型参数和隐藏状态变量。我们将提出的分层贝叶斯状态空间模型应用于皮质类固醇(CS)药物给药后的多个组织(肝脏、骨骼肌和肾脏)Affymetrix时间过程数据集。模拟和实际数据分析结果都表明,所提出的模型可以很好地捕捉CS处理后基因组随时间的变化和基因-基因相互作用。提出的动态层次贝叶斯状态空间建模方法可以扩展并应用于其他大规模基因组数据,例如将下一代序列(NGS)与实时和时变电子健康记录(EHR)相结合,进行更全面、更稳健的系统和基于网络的分析,从而将大生物医学数据转化为精准医疗和个性化医疗的预测和诊断,从而实现更好的决策和患者结果。
{"title":"Bayesian state space models for dynamic genetic network construction across multiple tissues.","authors":"Yulan Liang,&nbsp;Arpad Kelemen","doi":"10.1515/sagmb-2014-0055","DOIUrl":"https://doi.org/10.1515/sagmb-2014-0055","url":null,"abstract":"<p><p>Construction of gene-gene interaction networks and potential pathways is a challenging and important problem in genomic research for complex diseases while estimating the dynamic changes of the temporal correlations and non-stationarity are the keys in this process. In this paper, we develop dynamic state space models with hierarchical Bayesian settings to tackle this challenge for inferring the dynamic profiles and genetic networks associated with disease treatments. We treat both the stochastic transition matrix and the observation matrix time-variant and include temporal correlation structures in the covariance matrix estimations in the multivariate Bayesian state space models. The unevenly spaced short time courses with unseen time points are treated as hidden state variables. Hierarchical Bayesian approaches with various prior and hyper-prior models with Monte Carlo Markov Chain and Gibbs sampling algorithms are used to estimate the model parameters and the hidden state variables. We apply the proposed Hierarchical Bayesian state space models to multiple tissues (liver, skeletal muscle, and kidney) Affymetrix time course data sets following corticosteroid (CS) drug administration. Both simulation and real data analysis results show that the genomic changes over time and gene-gene interaction in response to CS treatment can be well captured by the proposed models. The proposed dynamic Hierarchical Bayesian state space modeling approaches could be expanded and applied to other large scale genomic data, such as next generation sequence (NGS) combined with real time and time varying electronic health record (EHR) for more comprehensive and robust systematic and network based analysis in order to transform big biomedical data into predictions and diagnostics for precision medicine and personalized healthcare with better decision making and patient outcomes.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 4","pages":"273-90"},"PeriodicalIF":0.9,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2014-0055","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34609077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
LandScape: a simple method to aggregate p-values and other stochastic variables without a priori grouping. 景观:一种简单的方法来汇总p值和其他随机变量没有先验分组。
IF 0.9 4区 数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2016-08-01 DOI: 10.1515/sagmb-2015-0085
Carsten Wiuf, Jonatan Schaumburg-Müller Pallesen, Leslie Foldager, Jakob Grove

In many areas of science it is custom to perform many, potentially millions, of tests simultaneously. To gain statistical power it is common to group tests based on a priori criteria such as predefined regions or by sliding windows. However, it is not straightforward to choose grouping criteria and the results might depend on the chosen criteria. Methods that summarize, or aggregate, test statistics or p-values, without relying on a priori criteria, are therefore desirable. We present a simple method to aggregate a sequence of stochastic variables, such as test statistics or p-values, into fewer variables without assuming a priori defined groups. We provide different ways to evaluate the significance of the aggregated variables based on theoretical considerations and resampling techniques, and show that under certain assumptions the FWER is controlled in the strong sense. Validity of the method was demonstrated using simulations and real data analyses. Our method may be a useful supplement to standard procedures relying on evaluation of test statistics individually. Moreover, by being agnostic and not relying on predefined selected regions, it might be a practical alternative to conventionally used methods of aggregation of p-values over regions. The method is implemented in Python and freely available online (through GitHub, see the Supplementary information).

在许多科学领域,人们习惯同时进行许多,甚至可能是数百万次的测试。为了获得统计能力,通常基于先验标准(如预定义区域或滑动窗口)对测试进行分组。然而,选择分组标准并不简单,结果可能取决于所选择的标准。因此,不依赖于先验标准的总结或汇总检验统计量或p值的方法是可取的。我们提出了一种简单的方法,将一系列随机变量(如检验统计量或p值)聚合到更少的变量中,而无需假设先验定义的组。基于理论考虑和重采样技术,我们提供了不同的方法来评估聚合变量的显著性,并表明在某些假设下,FWER在强意义上受到控制。通过仿真和实际数据分析,验证了该方法的有效性。我们的方法可能是一个有用的补充标准程序依赖于评估检验统计量单独。此外,由于它是不可知论的,并且不依赖于预定义的选定区域,它可能是传统上使用的p值在区域上聚合方法的一种实际替代方法。该方法是用Python实现的,并且可以在线免费获得(通过GitHub,参见补充信息)。
{"title":"LandScape: a simple method to aggregate p-values and other stochastic variables without a priori grouping.","authors":"Carsten Wiuf,&nbsp;Jonatan Schaumburg-Müller Pallesen,&nbsp;Leslie Foldager,&nbsp;Jakob Grove","doi":"10.1515/sagmb-2015-0085","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0085","url":null,"abstract":"<p><p>In many areas of science it is custom to perform many, potentially millions, of tests simultaneously. To gain statistical power it is common to group tests based on a priori criteria such as predefined regions or by sliding windows. However, it is not straightforward to choose grouping criteria and the results might depend on the chosen criteria. Methods that summarize, or aggregate, test statistics or p-values, without relying on a priori criteria, are therefore desirable. We present a simple method to aggregate a sequence of stochastic variables, such as test statistics or p-values, into fewer variables without assuming a priori defined groups. We provide different ways to evaluate the significance of the aggregated variables based on theoretical considerations and resampling techniques, and show that under certain assumptions the FWER is controlled in the strong sense. Validity of the method was demonstrated using simulations and real data analyses. Our method may be a useful supplement to standard procedures relying on evaluation of test statistics individually. Moreover, by being agnostic and not relying on predefined selected regions, it might be a practical alternative to conventionally used methods of aggregation of p-values over regions. The method is implemented in Python and freely available online (through GitHub, see the Supplementary information).</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 4","pages":"349-61"},"PeriodicalIF":0.9,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0085","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34554070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A joint modeling approach for uncovering associations between gene expression, bioactivity and chemical structure in early drug discovery to guide lead selection and genomic biomarker development. 一种联合建模方法,揭示早期药物发现中基因表达、生物活性和化学结构之间的关联,以指导先导物选择和基因组生物标志物的开发。
IF 0.9 4区 数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2016-08-01 DOI: 10.1515/sagmb-2014-0086
Nolen Perualila-Tan, Adetayo Kasim, Willem Talloen, Bie Verbist, Hinrich W H Göhlmann, Ziv Shkedy

The modern drug discovery process involves multiple sources of high-dimensional data. This imposes the challenge of data integration. A typical example is the integration of chemical structure (fingerprint features), phenotypic bioactivity (bioassay read-outs) data for targets of interest, and transcriptomic (gene expression) data in early drug discovery to better understand the chemical and biological mechanisms of candidate drugs, and to facilitate early detection of safety issues prior to later and expensive phases of drug development cycles. In this paper, we discuss a joint model for the transcriptomic and the phenotypic variables conditioned on the chemical structure. This modeling approach can be used to uncover, for a given set of compounds, the association between gene expression and biological activity taking into account the influence of the chemical structure of the compound on both variables. The model allows to detect genes that are associated with the bioactivity data facilitating the identification of potential genomic biomarkers for compounds efficacy. In addition, the effect of every structural feature on both genes and pIC50 and their associations can be simultaneously investigated. Two oncology projects are used to illustrate the applicability and usefulness of the joint model to integrate multi-source high-dimensional information to aid drug discovery.

现代药物发现过程涉及多个高维数据来源。这就带来了数据集成的挑战。一个典型的例子是在早期药物发现中整合化学结构(指纹特征)、感兴趣靶点的表型生物活性(生物测定读数)数据和转录组学(基因表达)数据,以更好地了解候选药物的化学和生物学机制,并促进在药物开发周期后期和昂贵阶段之前早期发现安全性问题。在本文中,我们讨论了一个联合模型转录组和表型变量条件下的化学结构。这种建模方法可以用来揭示,对于一组给定的化合物,基因表达和生物活性之间的关联,同时考虑到化合物的化学结构对这两个变量的影响。该模型允许检测与生物活性数据相关的基因,从而促进化合物功效的潜在基因组生物标志物的鉴定。此外,每个结构特征对两个基因和pIC50的影响及其相关性可以同时进行研究。两个肿瘤学项目被用来说明联合模型在整合多源高维信息以帮助药物发现方面的适用性和实用性。
{"title":"A joint modeling approach for uncovering associations between gene expression, bioactivity and chemical structure in early drug discovery to guide lead selection and genomic biomarker development.","authors":"Nolen Perualila-Tan,&nbsp;Adetayo Kasim,&nbsp;Willem Talloen,&nbsp;Bie Verbist,&nbsp;Hinrich W H Göhlmann,&nbsp;Ziv Shkedy","doi":"10.1515/sagmb-2014-0086","DOIUrl":"https://doi.org/10.1515/sagmb-2014-0086","url":null,"abstract":"<p><p>The modern drug discovery process involves multiple sources of high-dimensional data. This imposes the challenge of data integration. A typical example is the integration of chemical structure (fingerprint features), phenotypic bioactivity (bioassay read-outs) data for targets of interest, and transcriptomic (gene expression) data in early drug discovery to better understand the chemical and biological mechanisms of candidate drugs, and to facilitate early detection of safety issues prior to later and expensive phases of drug development cycles. In this paper, we discuss a joint model for the transcriptomic and the phenotypic variables conditioned on the chemical structure. This modeling approach can be used to uncover, for a given set of compounds, the association between gene expression and biological activity taking into account the influence of the chemical structure of the compound on both variables. The model allows to detect genes that are associated with the bioactivity data facilitating the identification of potential genomic biomarkers for compounds efficacy. In addition, the effect of every structural feature on both genes and pIC50 and their associations can be simultaneously investigated. Two oncology projects are used to illustrate the applicability and usefulness of the joint model to integrate multi-source high-dimensional information to aid drug discovery.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 4","pages":"291-304"},"PeriodicalIF":0.9,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2014-0086","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34616483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Finding causative genes from high-dimensional data: an appraisal of statistical and machine learning approaches. 从高维数据中寻找致病基因:对统计和机器学习方法的评估。
IF 0.9 4区 数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2016-08-01 DOI: 10.1515/sagmb-2015-0072
Chamont Wang, Jana L Gevertz

Modern biological experiments often involve high-dimensional data with thousands or more variables. A challenging problem is to identify the key variables that are related to a specific disease. Confounding this task is the vast number of statistical methods available for variable selection. For this reason, we set out to develop a framework to investigate the variable selection capability of statistical methods that are commonly applied to analyze high-dimensional biological datasets. Specifically, we designed six simulated cancers (based on benchmark colon and prostate cancer data) where we know precisely which genes cause a dataset to be classified as cancerous or normal - we call these causative genes. We found that not one statistical method tested could identify all the causative genes for all of the simulated cancers, even though increasing the sample size does improve the variable selection capabilities in most cases. Furthermore, certain statistical tools can classify our simulated data with a low error rate, yet the variables being used for classification are not necessarily the causative genes.

现代生物学实验通常涉及具有数千个或更多变量的高维数据。一个具有挑战性的问题是确定与特定疾病相关的关键变量。使这项任务复杂化的是可供变量选择的大量统计方法。出于这个原因,我们着手开发一个框架来研究通常用于分析高维生物数据集的统计方法的变量选择能力。具体来说,我们设计了六种模拟癌症(基于基准结肠癌和前列腺癌数据),我们精确地知道哪些基因导致数据集被分类为癌症或正常-我们称之为致病基因。我们发现,即使在大多数情况下增加样本量确实提高了变量选择能力,但没有一种统计方法可以识别所有模拟癌症的所有致病基因。此外,某些统计工具可以以较低的错误率对我们的模拟数据进行分类,但用于分类的变量不一定是致病基因。
{"title":"Finding causative genes from high-dimensional data: an appraisal of statistical and machine learning approaches.","authors":"Chamont Wang,&nbsp;Jana L Gevertz","doi":"10.1515/sagmb-2015-0072","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0072","url":null,"abstract":"<p><p>Modern biological experiments often involve high-dimensional data with thousands or more variables. A challenging problem is to identify the key variables that are related to a specific disease. Confounding this task is the vast number of statistical methods available for variable selection. For this reason, we set out to develop a framework to investigate the variable selection capability of statistical methods that are commonly applied to analyze high-dimensional biological datasets. Specifically, we designed six simulated cancers (based on benchmark colon and prostate cancer data) where we know precisely which genes cause a dataset to be classified as cancerous or normal - we call these causative genes. We found that not one statistical method tested could identify all the causative genes for all of the simulated cancers, even though increasing the sample size does improve the variable selection capabilities in most cases. Furthermore, certain statistical tools can classify our simulated data with a low error rate, yet the variables being used for classification are not necessarily the causative genes.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 4","pages":"321-47"},"PeriodicalIF":0.9,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0072","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34519893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Sparse factor model for co-expression networks with an application using prior biological knowledge. 基于先验生物学知识的共表达网络稀疏因子模型。
IF 0.9 4区 数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2016-06-01 DOI: 10.1515/sagmb-2015-0002
Yuna Blum, Magalie Houée-Bigot, David Causeur
Abstract Inference on gene regulatory networks from high-throughput expression data turns out to be one of the main current challenges in systems biology. Such networks can be very insightful for the deep understanding of interactions between genes. Because genes-gene interactions is often viewed as joint contributions to known biological mechanisms, inference on the dependence among gene expressions is expected to be consistent to some extent with the functional characterization of genes which can be derived from ontologies (GO, KEGG, …). The present paper introduces a sparse factor model as a general framework either to account for a prior knowledge on joint contributions of modules of genes to latent biological processes or to infer on the corresponding co-expression network. We propose an ℓ1 – regularized EM algorithm to fit a sparse factor model for correlation. We demonstrate how it helps extracting modules of genes and more generally improves the gene clustering performance. The method is compared to alternative estimation procedures for sparse factor models of relevance networks in a simulation study. The integration of a biological knowledge based on the gene ontology (GO) is also illustrated on a liver expression data generated to understand adiposity variability in chicken.
从高通量表达数据推断基因调控网络是当前系统生物学的主要挑战之一。这样的网络对于深入理解基因之间的相互作用非常有洞察力。由于基因-基因相互作用通常被视为对已知生物机制的共同贡献,对基因表达之间依赖性的推断预计在某种程度上与可以从本体论(GO, KEGG,…)中衍生的基因的功能表征相一致。本文引入了一个稀疏因子模型作为一般框架,用于解释基因模块对潜在生物过程的联合贡献的先验知识或推断相应的共表达网络。我们提出了一种1 -正则化的EM算法来拟合稀疏因子模型。我们演示了它如何帮助提取基因模块,并更普遍地提高基因聚类性能。在仿真研究中,将该方法与相关网络稀疏因子模型的替代估计方法进行了比较。基于基因本体(GO)的生物学知识的整合也说明了肝脏表达数据的生成,以了解鸡的肥胖变异性。
{"title":"Sparse factor model for co-expression networks with an application using prior biological knowledge.","authors":"Yuna Blum,&nbsp;Magalie Houée-Bigot,&nbsp;David Causeur","doi":"10.1515/sagmb-2015-0002","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0002","url":null,"abstract":"Abstract Inference on gene regulatory networks from high-throughput expression data turns out to be one of the main current challenges in systems biology. Such networks can be very insightful for the deep understanding of interactions between genes. Because genes-gene interactions is often viewed as joint contributions to known biological mechanisms, inference on the dependence among gene expressions is expected to be consistent to some extent with the functional characterization of genes which can be derived from ontologies (GO, KEGG, …). The present paper introduces a sparse factor model as a general framework either to account for a prior knowledge on joint contributions of modules of genes to latent biological processes or to infer on the corresponding co-expression network. We propose an ℓ1 – regularized EM algorithm to fit a sparse factor model for correlation. We demonstrate how it helps extracting modules of genes and more generally improves the gene clustering performance. The method is compared to alternative estimation procedures for sparse factor models of relevance networks in a simulation study. The integration of a biological knowledge based on the gene ontology (GO) is also illustrated on a liver expression data generated to understand adiposity variability in chicken.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 3","pages":"253-72"},"PeriodicalIF":0.9,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0002","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34377789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Bayesian mixed-effects model for the analysis of a series of FRAP images. 贝叶斯混合效应模型对一系列FRAP图像的分析。
IF 0.9 4区 数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2015-02-01 DOI: 10.1515/sagmb-2014-0013
Martina Feilke, Katrin Schneider, Volker J Schmid

The binding behavior of molecules in nuclei of living cells can be studied through the analysis of images from fluorescence recovery after photobleaching experiments. However, there is still a lack of methodology for the statistical evaluation of FRAP data, especially for the joint analysis of multiple dynamic images. We propose a hierarchical Bayesian nonlinear model with mixed-effect priors based on local compartment models in order to obtain joint parameter estimates for all nuclei as well as to account for the heterogeneity of the nuclei population. We apply our method to a series of FRAP experiments of DNA methyltransferase 1 tagged to green fluorescent protein expressed in a somatic mouse cell line and compare the results to the application of three different fixed-effects models to the same series of FRAP experiments. With the proposed model, we get estimates of the off-rates of the interactions of the molecules under study together with credible intervals, and additionally gain information about the variability between nuclei. The proposed model is superior to and more robust than the tested fixed-effects models. Therefore, it can be used for the joint analysis of data from FRAP experiments on various similar nuclei.

通过对光漂白实验后荧光恢复图像的分析,可以研究活细胞细胞核内分子的结合行为。然而,对于FRAP数据的统计评价,特别是多幅动态图像的联合分析,目前还缺乏方法论。为了获得所有核的联合参数估计以及考虑核种群的异质性,我们提出了一种基于局部室模型的混合效应非线性分层贝叶斯模型。我们将该方法应用于体细胞小鼠细胞系中表达的绿色荧光蛋白标记的DNA甲基转移酶1的一系列FRAP实验,并将结果与三种不同的固定效应模型应用于同一系列FRAP实验进行了比较。利用所提出的模型,我们得到了所研究分子相互作用的偏离率和可信区间的估计值,并获得了核间变异性的信息。该模型优于已有的固定效应模型,且鲁棒性更强。因此,它可以用于各种相似核的FRAP实验数据的联合分析。
{"title":"Bayesian mixed-effects model for the analysis of a series of FRAP images.","authors":"Martina Feilke,&nbsp;Katrin Schneider,&nbsp;Volker J Schmid","doi":"10.1515/sagmb-2014-0013","DOIUrl":"https://doi.org/10.1515/sagmb-2014-0013","url":null,"abstract":"<p><p>The binding behavior of molecules in nuclei of living cells can be studied through the analysis of images from fluorescence recovery after photobleaching experiments. However, there is still a lack of methodology for the statistical evaluation of FRAP data, especially for the joint analysis of multiple dynamic images. We propose a hierarchical Bayesian nonlinear model with mixed-effect priors based on local compartment models in order to obtain joint parameter estimates for all nuclei as well as to account for the heterogeneity of the nuclei population. We apply our method to a series of FRAP experiments of DNA methyltransferase 1 tagged to green fluorescent protein expressed in a somatic mouse cell line and compare the results to the application of three different fixed-effects models to the same series of FRAP experiments. With the proposed model, we get estimates of the off-rates of the interactions of the molecules under study together with credible intervals, and additionally gain information about the variability between nuclei. The proposed model is superior to and more robust than the tested fixed-effects models. Therefore, it can be used for the joint analysis of data from FRAP experiments on various similar nuclei.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"14 1","pages":"35-51"},"PeriodicalIF":0.9,"publicationDate":"2015-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2014-0013","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32905409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Statistical Applications in Genetics and Molecular Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1