Statistical Applications in Genetics and Molecular Biology最新文献_第5页

A time warping approach to multiple sequence alignment. 多序列比对的时间规整方法。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-04-25 DOI: 10.1515/sagmb-2016-0043

Ana Arribas-Gil, Catherine Matias

We propose an approach for multiple sequence alignment (MSA) derived from the dynamic time warping viewpoint and recent techniques of curve synchronization developed in the context of functional data analysis. Starting from pairwise alignments of all the sequences (viewed as paths in a certain space), we construct a median path that represents the MSA we are looking for. We establish a proof of concept that our method could be an interesting ingredient to include into refined MSA techniques. We present a simple synthetic experiment as well as the study of a benchmark dataset, together with comparisons with 2 widely used MSA softwares.

我们提出了一种基于动态时间规整观点的多序列比对方法，以及在功能数据分析背景下发展起来的曲线同步技术。从所有序列的成对对齐开始(被视为特定空间中的路径)，我们构建一个表示我们正在寻找的MSA的中位数路径。我们建立了一个概念证明，我们的方法可能是一个有趣的成分，可以包含在改进的MSA技术中。我们提出了一个简单的合成实验，以及一个基准数据集的研究，并与2个广泛使用的MSA软件进行了比较。

引用次数: 1

No counts, no variance: allowing for loss of degrees of freedom when assessing biological variability from RNA-seq data. 没有计数，没有方差:在评估RNA-seq数据的生物变异性时，考虑到自由度的损失。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-04-25 DOI: 10.1515/sagmb-2017-0010

Aaron T L Lun, Gordon K Smyth

Abstract RNA sequencing (RNA-seq) is widely used to study gene expression changes associated with treatments or biological conditions. Many popular methods for detecting differential expression (DE) from RNA-seq data use generalized linear models (GLMs) fitted to the read counts across independent replicate samples for each gene. This article shows that the standard formula for the residual degrees of freedom (d.f.) in a linear model is overstated when the model contains fitted values that are exactly zero. Such fitted values occur whenever all the counts in a treatment group are zero as well as in more complex models such as those involving paired comparisons. This misspecification results in underestimation of the genewise variances and loss of type I error control. This article proposes a formula for the reduced residual d.f. that restores error control in simulated RNA-seq data and improves detection of DE genes in a real data analysis. The new approach is implemented in the quasi-likelihood framework of the edgeR software package. The results of this article also apply to RNA-seq analyses that apply linear models to log-transformed counts, such as those in the limma software package, and more generally to any count-based GLM where exactly zero fitted values are possible.

RNA测序(RNA-seq)被广泛用于研究与治疗或生物条件相关的基因表达变化。从RNA-seq数据中检测差异表达(DE)的许多流行方法使用广义线性模型(GLMs)拟合每个基因的独立重复样本的读取计数。本文表明，当模型包含的拟合值恰好为零时，线性模型中剩余自由度(d.f.)的标准公式被夸大了。这样的拟合值出现在任何治疗组的所有计数为零的情况下，以及在更复杂的模型中，如那些涉及成对比较的模型中。这种错误的说明导致了基因方差的低估和I型误差控制的丧失。本文提出了一个减少残差d.f.的公式，该公式恢复了模拟RNA-seq数据中的错误控制，并提高了真实数据分析中DE基因的检测。该方法是在edgeR软件包的准似然框架中实现的。本文的结果也适用于将线性模型应用于对数转换计数(如limma软件包中的计数)的RNA-seq分析，以及更普遍地应用于任何基于计数的GLM，其中可能恰好为零拟合值。

{"title":"No counts, no variance: allowing for loss of degrees of freedom when assessing biological variability from RNA-seq data.","authors":"Aaron T L Lun, Gordon K Smyth","doi":"10.1515/sagmb-2017-0010","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0010","url":null,"abstract":"Abstract RNA sequencing (RNA-seq) is widely used to study gene expression changes associated with treatments or biological conditions. Many popular methods for detecting differential expression (DE) from RNA-seq data use generalized linear models (GLMs) fitted to the read counts across independent replicate samples for each gene. This article shows that the standard formula for the residual degrees of freedom (d.f.) in a linear model is overstated when the model contains fitted values that are exactly zero. Such fitted values occur whenever all the counts in a treatment group are zero as well as in more complex models such as those involving paired comparisons. This misspecification results in underestimation of the genewise variances and loss of type I error control. This article proposes a formula for the reduced residual d.f. that restores error control in simulated RNA-seq data and improves detection of DE genes in a real data analysis. The new approach is implemented in the quasi-likelihood framework of the edgeR software package. The results of this article also apply to RNA-seq analyses that apply linear models to log-transformed counts, such as those in the limma software package, and more generally to any count-based GLM where exactly zero fitted values are possible.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 2","pages":"83-93"},"PeriodicalIF":0.9,"publicationDate":"2017-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0010","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35075863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

A Bayesian semiparametric factor analysis model for subtype identification. 一种用于亚型识别的贝叶斯半参数因子分析模型。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-04-25 DOI: 10.1515/sagmb-2016-0051

Jiehuan Sun, Joshua L Warren, Hongyu Zhao

Disease subtype identification (clustering) is an important problem in biomedical research. Gene expression profiles are commonly utilized to infer disease subtypes, which often lead to biologically meaningful insights into disease. Despite many successes, existing clustering methods may not perform well when genes are highly correlated and many uninformative genes are included for clustering due to the high dimensionality. In this article, we introduce a novel subtype identification method in the Bayesian setting based on gene expression profiles. This method, called BCSub, adopts an innovative semiparametric Bayesian factor analysis model to reduce the dimension of the data to a few factor scores for clustering. Specifically, the factor scores are assumed to follow the Dirichlet process mixture model in order to induce clustering. Through extensive simulation studies, we show that BCSub has improved performance over commonly used clustering methods. When applied to two gene expression datasets, our model is able to identify subtypes that are clinically more relevant than those identified from the existing methods.

疾病亚型识别(聚类)是生物医学研究中的一个重要问题。基因表达谱通常用于推断疾病亚型，这通常会导致对疾病有生物学意义的见解。尽管已有的聚类方法取得了许多成功，但当基因高度相关且由于高维数而包含许多无信息的基因时，现有的聚类方法可能表现不佳。在本文中，我们介绍了一种新的基于基因表达谱的贝叶斯亚型鉴定方法。这种方法被称为BCSub，它采用一种创新的半参数贝叶斯因子分析模型，将数据的维数降至几个因子得分进行聚类。具体来说，为了诱导聚类，假设因子得分遵循Dirichlet过程混合模型。通过大量的仿真研究，我们表明BCSub比常用的聚类方法具有更高的性能。当应用于两个基因表达数据集时，我们的模型能够识别出比现有方法更具有临床相关性的亚型。

{"title":"A Bayesian semiparametric factor analysis model for subtype identification.","authors":"Jiehuan Sun, Joshua L Warren, Hongyu Zhao","doi":"10.1515/sagmb-2016-0051","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0051","url":null,"abstract":"Disease subtype identification (clustering) is an important problem in biomedical research. Gene expression profiles are commonly utilized to infer disease subtypes, which often lead to biologically meaningful insights into disease. Despite many successes, existing clustering methods may not perform well when genes are highly correlated and many uninformative genes are included for clustering due to the high dimensionality. In this article, we introduce a novel subtype identification method in the Bayesian setting based on gene expression profiles. This method, called BCSub, adopts an innovative semiparametric Bayesian factor analysis model to reduce the dimension of the data to a few factor scores for clustering. Specifically, the factor scores are assumed to follow the Dirichlet process mixture model in order to induce clustering. Through extensive simulation studies, we show that BCSub has improved performance over commonly used clustering methods. When applied to two gene expression datasets, our model is able to identify subtypes that are clinically more relevant than those identified from the existing methods.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 2","pages":"145-158"},"PeriodicalIF":0.9,"publicationDate":"2017-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0051","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34856667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Missing value imputation for gene expression data by tailored nearest neighbors. 基因表达数据的缺失值估算方法。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-04-25 DOI: 10.1515/sagmb-2015-0098

Shahla Faisal, Gerhard Tutz

High dimensional data like gene expression and RNA-sequences often contain missing values. The subsequent analysis and results based on these incomplete data can suffer strongly from the presence of these missing values. Several approaches to imputation of missing values in gene expression data have been developed but the task is difficult due to the high dimensionality (number of genes) of the data. Here an imputation procedure is proposed that uses weighted nearest neighbors. Instead of using nearest neighbors defined by a distance that includes all genes the distance is computed for genes that are apt to contribute to the accuracy of imputed values. The method aims at avoiding the curse of dimensionality, which typically occurs if local methods as nearest neighbors are applied in high dimensional settings. The proposed weighted nearest neighbors algorithm is compared to existing missing value imputation techniques like mean imputation, KNNimpute and the recently proposed imputation by random forests. We use RNA-sequence and microarray data from studies on human cancer to compare the performance of the methods. The results from simulations as well as real studies show that the weighted distance procedure can successfully handle missing values for high dimensional data structures where the number of predictors is larger than the number of samples. The method typically outperforms the considered competitors.

像基因表达和rna序列这样的高维数据经常包含缺失值。基于这些不完整数据的后续分析和结果可能会受到这些缺失值的严重影响。基因表达数据缺失值的估算方法已经发展起来，但由于数据的高维数(基因数量)，任务比较困难。本文提出了一种加权最近邻法的插值方法。不是使用包含所有基因的距离定义的最近邻，而是计算易于对输入值的准确性做出贡献的基因的距离。该方法旨在避免在高维环境中应用作为最近邻的局部方法通常会出现的维数诅咒。将所提出的加权最近邻算法与现有的缺失值imputation技术如mean imputation、KNNimpute和最近提出的随机森林imputation进行了比较。我们使用来自人类癌症研究的rna序列和微阵列数据来比较这些方法的性能。仿真和实际研究结果表明，加权距离方法可以成功地处理预测数大于样本数的高维数据结构的缺失值。该方法通常优于考虑过的竞争对手。

{"title":"Missing value imputation for gene expression data by tailored nearest neighbors.","authors":"Shahla Faisal, Gerhard Tutz","doi":"10.1515/sagmb-2015-0098","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0098","url":null,"abstract":"High dimensional data like gene expression and RNA-sequences often contain missing values. The subsequent analysis and results based on these incomplete data can suffer strongly from the presence of these missing values. Several approaches to imputation of missing values in gene expression data have been developed but the task is difficult due to the high dimensionality (number of genes) of the data. Here an imputation procedure is proposed that uses weighted nearest neighbors. Instead of using nearest neighbors defined by a distance that includes all genes the distance is computed for genes that are apt to contribute to the accuracy of imputed values. The method aims at avoiding the curse of dimensionality, which typically occurs if local methods as nearest neighbors are applied in high dimensional settings. The proposed weighted nearest neighbors algorithm is compared to existing missing value imputation techniques like mean imputation, KNNimpute and the recently proposed imputation by random forests. We use RNA-sequence and microarray data from studies on human cancer to compare the performance of the methods. The results from simulations as well as real studies show that the weighted distance procedure can successfully handle missing values for high dimensional data structures where the number of predictors is larger than the number of samples. The method typically outperforms the considered competitors.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 2","pages":"95-106"},"PeriodicalIF":0.9,"publicationDate":"2017-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0098","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35071043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Robin Hood: A cost-efficient two-stage approach to large-scale simultaneous inference with non-homogeneous sparse effects. 罗宾汉:具有非均匀稀疏效应的大规模同时推理的一种经济高效的两阶段方法。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-04-25 DOI: 10.1515/sagmb-2016-0039

Jakub Pecanka, Jelle Goeman

A classical approach to experimental design in many scientific fields is to first gather all of the data and then analyze it in a single analysis. It has been recognized that in many areas such practice leaves substantial room for improvement in terms of the researcher's ability to identify relevant effects, in terms of cost efficiency, or both. Considerable attention has been paid in recent years to multi-stage designs, in which the user alternates between data collection and analysis and thereby sequentially reduces the size of the problem. However, the focus has generally been towards designs that require a hypothesis be tested in every single stage before it can be declared as rejected by the procedure. Such procedures are well-suited for homogeneous effects, i.e. effects of (almost) equal sizes, however, with effects of varying size a procedure that permits rejection at interim stages is much more suitable. Here we present precisely such multi-stage testing procedure called Robin Hood. We show that with heterogeneous effects our method substantially improves on the existing multi-stage procedures with an essentially zero efficiency trade-off in the homogeneous effect realm, which makes it especially useful in areas such as genetics, where heterogeneous effects are common. Our method improves on existing approaches in a number of ways including a novel way of performing two-sided testing in a multi-stage procedure with increased power for detecting small effects.

在许多科学领域，实验设计的经典方法是首先收集所有数据，然后在一次分析中对其进行分析。人们已经认识到，在许多领域，这种做法在研究人员确定相关影响的能力、成本效率或两者兼而有之方面留下了很大的改进空间。近年来，人们对多阶段设计给予了相当大的关注，在这种设计中，用户在数据收集和分析之间交替进行，从而依次减少了问题的规模。然而，设计的重点通常是要求在每个阶段都对假设进行测试，然后才能被程序宣布为拒绝。这种程序非常适合于均质效应，即(几乎)大小相等的效应，然而，对于大小不等的效应，允许在中间阶段拒绝的程序更为合适。在这里，我们精确地介绍了这种称为罗宾汉的多阶段测试程序。我们表明，对于异质效应，我们的方法大大改进了现有的多阶段程序，在同质效应领域中基本上没有效率权衡，这使得它在异质效应普遍存在的遗传学等领域特别有用。我们的方法在许多方面改进了现有的方法，包括在多阶段程序中执行双侧测试的新方法，提高了检测小影响的能力。

{"title":"Robin Hood: A cost-efficient two-stage approach to large-scale simultaneous inference with non-homogeneous sparse effects.","authors":"Jakub Pecanka, Jelle Goeman","doi":"10.1515/sagmb-2016-0039","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0039","url":null,"abstract":"A classical approach to experimental design in many scientific fields is to first gather all of the data and then analyze it in a single analysis. It has been recognized that in many areas such practice leaves substantial room for improvement in terms of the researcher's ability to identify relevant effects, in terms of cost efficiency, or both. Considerable attention has been paid in recent years to multi-stage designs, in which the user alternates between data collection and analysis and thereby sequentially reduces the size of the problem. However, the focus has generally been towards designs that require a hypothesis be tested in every single stage before it can be declared as rejected by the procedure. Such procedures are well-suited for homogeneous effects, i.e. effects of (almost) equal sizes, however, with effects of varying size a procedure that permits rejection at interim stages is much more suitable. Here we present precisely such multi-stage testing procedure called Robin Hood. We show that with heterogeneous effects our method substantially improves on the existing multi-stage procedures with an essentially zero efficiency trade-off in the homogeneous effect realm, which makes it especially useful in areas such as genetics, where heterogeneous effects are common. Our method improves on existing approaches in a number of ways including a novel way of performing two-sided testing in a multi-stage procedure with increased power for detecting small effects.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 2","pages":"107-132"},"PeriodicalIF":0.9,"publicationDate":"2017-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0039","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35075861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Bivariate Poisson models with varying offsets: an application to the paired mitochondrial DNA dataset. 具有不同偏移量的双变量泊松模型:对配对线粒体DNA数据集的应用。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-03-01 DOI: 10.1515/sagmb-2016-0040

Pei-Fang Su, Yu-Lin Mau, Yan Guo, Chung-I Li, Qi Liu, John D Boice, Yu Shyr

To assess the effect of chemotherapy on mitochondrial genome mutations in cancer survivors and their offspring, a study sequenced the full mitochondrial genome and determined the mitochondrial DNA heteroplasmic (mtDNA) mutation rate. To build a model for counts of heteroplasmic mutations in mothers and their offspring, bivariate Poisson regression was used to examine the relationship between mutation count and clinical information while accounting for the paired correlation. However, if the sequencing depth is not adequate, a limited fraction of the mtDNA will be available for variant calling. The classical bivariate Poisson regression model treats the offset term as equal within pairs; thus, it cannot be applied directly. In this research, we propose an extended bivariate Poisson regression model that has a more general offset term to adjust the length of the accessible genome for each observation. We evaluate the performance of the proposed method with comprehensive simulations, and the results show that the regression model provides unbiased parameter estimations. The use of the model is also demonstrated using the paired mtDNA dataset.

为了评估化疗对癌症幸存者及其后代线粒体基因组突变的影响，一项研究对全线粒体基因组进行了测序，并确定了线粒体DNA异质(mtDNA)突变率。为了建立母亲及其后代的异质突变计数模型，使用双变量泊松回归来检验突变计数与临床信息之间的关系，同时考虑配对相关性。然而，如果测序深度不够，mtDNA的有限部分将可用于变体调用。经典的二元泊松回归模型在对内将偏移项视为相等;因此，它不能直接应用。在这项研究中，我们提出了一个扩展的双变量泊松回归模型，该模型具有更一般的偏移项来调整每次观察的可访问基因组的长度。通过综合仿真对该方法的性能进行了评价，结果表明该回归模型能够提供无偏的参数估计。使用配对的mtDNA数据集也演示了该模型的使用。

{"title":"Bivariate Poisson models with varying offsets: an application to the paired mitochondrial DNA dataset.","authors":"Pei-Fang Su, Yu-Lin Mau, Yan Guo, Chung-I Li, Qi Liu, John D Boice, Yu Shyr","doi":"10.1515/sagmb-2016-0040","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0040","url":null,"abstract":"To assess the effect of chemotherapy on mitochondrial genome mutations in cancer survivors and their offspring, a study sequenced the full mitochondrial genome and determined the mitochondrial DNA heteroplasmic (mtDNA) mutation rate. To build a model for counts of heteroplasmic mutations in mothers and their offspring, bivariate Poisson regression was used to examine the relationship between mutation count and clinical information while accounting for the paired correlation. However, if the sequencing depth is not adequate, a limited fraction of the mtDNA will be available for variant calling. The classical bivariate Poisson regression model treats the offset term as equal within pairs; thus, it cannot be applied directly. In this research, we propose an extended bivariate Poisson regression model that has a more general offset term to adjust the length of the accessible genome for each observation. We evaluate the performance of the proposed method with comprehensive simulations, and the results show that the regression model provides unbiased parameter estimations. The use of the model is also demonstrated using the paired mtDNA dataset.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 1","pages":"47-58"},"PeriodicalIF":0.9,"publicationDate":"2017-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0040","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34774281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Comparison and visualisation of agreement for paired lists of rankings. 比较和可视化的协议配对列表的排名。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-03-01 DOI: 10.1515/sagmb-2016-0036

Margaret R Donald, Susan R Wilson

Output from analysis of a high-throughput 'omics' experiment very often is a ranked list. One commonly encountered example is a ranked list of differentially expressed genes from a gene expression experiment, with a length of many hundreds of genes. There are numerous situations where interest is in the comparison of outputs following, say, two (or more) different experiments, or of different approaches to the analysis that produce different ranked lists. Rather than considering exact agreement between the rankings, following others, we consider two ranked lists to be in agreement if the rankings differ by some fixed distance. Generally only a relatively small subset of the k top-ranked items will be in agreement. So the aim is to find the point k at which the probability of agreement in rankings changes from being greater than 0.5 to being less than 0.5. We use penalized splines and a Bayesian logit model, to give a nonparametric smooth to the sequence of agreements, as well as pointwise credible intervals for the probability of agreement. Our approach produces a point estimate and a credible interval for k. R code is provided. The method is applied to rankings of genes from breast cancer microarray experiments.

高通量“组学”实验的分析结果通常是一个排名列表。一个常见的例子是来自基因表达实验的差异表达基因的排序列表，长度为数百个基因。在许多情况下，我们感兴趣的是比较两个(或更多)不同实验之后的输出，或者比较产生不同排名列表的不同分析方法。如果两个排名表的排名相差一定的距离，我们就认为它们是一致的，而不是考虑它们之间的排名是否完全一致。一般来说，只有k个排名靠前的项目中相对较小的一部分是一致的。因此，我们的目标是找到k点，在这个点上，排名一致的概率从大于0.5变为小于0.5。我们使用惩罚样条和贝叶斯逻辑模型，给出了协议序列的非参数平滑，以及协议概率的点向可信区间。我们的方法产生k的点估计和可信区间。提供了R代码。该方法应用于乳腺癌微阵列实验中基因的排序。

{"title":"Comparison and visualisation of agreement for paired lists of rankings.","authors":"Margaret R Donald, Susan R Wilson","doi":"10.1515/sagmb-2016-0036","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0036","url":null,"abstract":"Output from analysis of a high-throughput 'omics' experiment very often is a ranked list. One commonly encountered example is a ranked list of differentially expressed genes from a gene expression experiment, with a length of many hundreds of genes. There are numerous situations where interest is in the comparison of outputs following, say, two (or more) different experiments, or of different approaches to the analysis that produce different ranked lists. Rather than considering exact agreement between the rankings, following others, we consider two ranked lists to be in agreement if the rankings differ by some fixed distance. Generally only a relatively small subset of the k top-ranked items will be in agreement. So the aim is to find the point k at which the probability of agreement in rankings changes from being greater than 0.5 to being less than 0.5. We use penalized splines and a Bayesian logit model, to give a nonparametric smooth to the sequence of agreements, as well as pointwise credible intervals for the probability of agreement. Our approach produces a point estimate and a credible interval for k. R code is provided. The method is applied to rankings of genes from breast cancer microarray experiments.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 1","pages":"31-45"},"PeriodicalIF":0.9,"publicationDate":"2017-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0036","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34803753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Binary Markov Random Fields and interpretable mass spectra discrimination. 二元马尔可夫随机场和可解释质谱鉴别。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-02-11 DOI: 10.1515/sagmb-2016-0019

Ao Kong, Robert Azencott

For mass spectra acquired from cancer patients by MALDI or SELDI techniques, automated discrimination between cancer types or stages has often been implemented by machine learning algorithms. Nevertheless, these techniques typically lack interpretability in terms of biomarkers. In this paper, we propose a new mass spectra discrimination algorithm by parameterized Markov Random Fields to automatically generate interpretable classifiers with small groups of scored biomarkers. A dataset of 238 MALDI colorectal mass spectra and two datasets of 216 and 253 SELDI ovarian mass spectra respectively were used to test our approach. The results show that our approach reaches accuracies of 81% to 100% to discriminate between patients from different colorectal and ovarian cancer stages, and performs as well or better than previous studies on similar datasets. Moreover, our approach enables efficient planar-displays to visualize mass spectra discrimination and has good asymptotic performance for large datasets. Thus, our classifiers should facilitate the choice and planning of further experiments for biological interpretation of cancer discriminating signatures. In our experiments, the number of mass spectra for each colorectal cancer stage is roughly half of that for each ovarian cancer stage, so that we reach lower discrimination accuracy for colorectal cancer than for ovarian cancer.

对于通过 MALDI 或 SELDI 技术从癌症患者身上获取的质谱，通常采用机器学习算法自动区分癌症类型或阶段。然而，这些技术通常缺乏生物标记的可解释性。在本文中，我们提出了一种新的质谱判别算法，该算法通过参数化马尔可夫随机场自动生成可解释的分类器，并对生物标记物进行分组评分。我们使用了一个包含 238 个 MALDI 大肠癌质谱的数据集和两个分别包含 216 个和 253 个 SELDI 卵巢癌质谱的数据集来测试我们的方法。结果表明，我们的方法在区分不同阶段的结直肠癌和卵巢癌患者方面达到了 81% 至 100% 的准确率，与之前类似数据集的研究结果相比，我们的方法表现更好甚至更好。此外，我们的方法还能以高效的平面显示方式可视化质谱鉴别，并对大型数据集具有良好的渐进性能。因此，我们的分类器有助于选择和规划进一步的实验，以对癌症鉴别特征进行生物学解释。在我们的实验中，每个结直肠癌阶段的质谱数量大约是每个卵巢癌阶段的一半，因此我们对结直肠癌的判别准确率低于卵巢癌。

{"title":"Binary Markov Random Fields and interpretable mass spectra discrimination.","authors":"Ao Kong, Robert Azencott","doi":"10.1515/sagmb-2016-0019","DOIUrl":"10.1515/sagmb-2016-0019","url":null,"abstract":"For mass spectra acquired from cancer patients by MALDI or SELDI techniques, automated discrimination between cancer types or stages has often been implemented by machine learning algorithms. Nevertheless, these techniques typically lack interpretability in terms of biomarkers. In this paper, we propose a new mass spectra discrimination algorithm by parameterized Markov Random Fields to automatically generate interpretable classifiers with small groups of scored biomarkers. A dataset of 238 MALDI colorectal mass spectra and two datasets of 216 and 253 SELDI ovarian mass spectra respectively were used to test our approach. The results show that our approach reaches accuracies of 81% to 100% to discriminate between patients from different colorectal and ovarian cancer stages, and performs as well or better than previous studies on similar datasets. Moreover, our approach enables efficient planar-displays to visualize mass spectra discrimination and has good asymptotic performance for large datasets. Thus, our classifiers should facilitate the choice and planning of further experiments for biological interpretation of cancer discriminating signatures. In our experiments, the number of mass spectra for each colorectal cancer stage is roughly half of that for each ovarian cancer stage, so that we reach lower discrimination accuracy for colorectal cancer than for ovarian cancer.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":" ","pages":""},"PeriodicalIF":0.9,"publicationDate":"2017-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34969665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Tree-based quantitative trait mapping in the presence of external covariates. 存在外部协变量的基于树的定量性状映射。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2016-12-01 DOI: 10.1515/sagmb-2015-0107

Katherine L Thompson, Catherine R Linnen, Laura Kubatko

A central goal in biological and biomedical sciences is to identify the molecular basis of variation in morphological and behavioral traits. Over the last decade, improvements in sequencing technologies coupled with the active development of association mapping methods have made it possible to link single nucleotide polymorphisms (SNPs) and quantitative traits. However, a major limitation of existing methods is that they are often unable to consider complex, but biologically-realistic, scenarios. Previous work showed that association mapping method performance can be improved by using the evolutionary history within each SNP to estimate the covariance structure among randomly-sampled individuals. Here, we propose a method that can be used to analyze a variety of data types, such as data including external covariates, while considering the evolutionary history among SNPs, providing an advantage over existing methods. Existing methods either do so at a computational cost, or fail to model these relationships altogether. By considering the broad-scale relationships among SNPs, the proposed approach is both computationally-feasible and informed by the evolutionary history among SNPs. We show that incorporating an approximate covariance structure during analysis of complex data sets increases performance in quantitative trait mapping, and apply the proposed method to deer mice data.

生物学和生物医学科学的中心目标是确定形态和行为特征变异的分子基础。在过去的十年中，测序技术的改进以及关联作图方法的积极发展使得将单核苷酸多态性(snp)与数量性状联系起来成为可能。然而，现有方法的一个主要限制是，它们往往无法考虑复杂的，但生物学上现实的情况。先前的研究表明，利用每个SNP内部的进化史来估计随机抽样个体之间的协方差结构可以提高关联映射方法的性能。在这里，我们提出了一种方法，可以用于分析各种数据类型，例如包括外部协变量的数据，同时考虑snp之间的进化史，提供比现有方法的优势。现有的方法要么在计算成本上这样做，要么不能完全模拟这些关系。考虑到SNPs之间的大范围关系，所提出的方法在计算上是可行的，并且被SNPs之间的进化史所告知。我们表明，在复杂数据集的分析中加入近似协方差结构可以提高定量性状映射的性能，并将所提出的方法应用于鹿鼠数据。

{"title":"Tree-based quantitative trait mapping in the presence of external covariates.","authors":"Katherine L Thompson, Catherine R Linnen, Laura Kubatko","doi":"10.1515/sagmb-2015-0107","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0107","url":null,"abstract":"A central goal in biological and biomedical sciences is to identify the molecular basis of variation in morphological and behavioral traits. Over the last decade, improvements in sequencing technologies coupled with the active development of association mapping methods have made it possible to link single nucleotide polymorphisms (SNPs) and quantitative traits. However, a major limitation of existing methods is that they are often unable to consider complex, but biologically-realistic, scenarios. Previous work showed that association mapping method performance can be improved by using the evolutionary history within each SNP to estimate the covariance structure among randomly-sampled individuals. Here, we propose a method that can be used to analyze a variety of data types, such as data including external covariates, while considering the evolutionary history among SNPs, providing an advantage over existing methods. Existing methods either do so at a computational cost, or fail to model these relationships altogether. By considering the broad-scale relationships among SNPs, the proposed approach is both computationally-feasible and informed by the evolutionary history among SNPs. We show that incorporating an approximate covariance structure during analysis of complex data sets increases performance in quantitative trait mapping, and apply the proposed method to deer mice data.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 6","pages":"473-490"},"PeriodicalIF":0.9,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0107","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39980704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Adaptive input data transformation for improved network reconstruction with information theoretic algorithms. 基于信息理论的自适应输入数据转换改进网络重构算法。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2016-12-01 DOI: 10.1515/sagmb-2016-0013

Venkateshan Kannan, Jesper Tegner

We propose a novel systematic procedure of non-linear data transformation for an adaptive algorithm in the context of network reverse-engineering using information theoretic methods. Our methodology is rooted in elucidating and correcting for the specific biases in the estimation techniques for mutual information (MI) given a finite sample of data. These are, in turn, tied to lack of well-defined bounds for numerical estimation of MI for continuous probability distributions from finite data. The nature and properties of the inevitable bias is described, complemented by several examples illustrating their form and variation. We propose an adaptive partitioning scheme for MI estimation that effectively transforms the sample data using parameters determined from its local and global distribution guaranteeing a more robust and reliable reconstruction algorithm. Together with a normalized measure (Shared Information Metric) we report considerably enhanced performance both for in silico and real-world biological networks. We also find that the recovery of true interactions is in particular better for intermediate range of false positive rates, suggesting that our algorithm is less vulnerable to spurious signals of association.

在网络逆向工程的背景下，我们提出了一种新的系统的非线性数据转换的自适应算法。我们的方法是植根于阐明和纠正特定的偏差估计技术的互信息(MI)给定有限的数据样本。反过来，这些问题与从有限数据中对连续概率分布的MI的数值估计缺乏明确定义的界限有关。本文描述了这种不可避免的偏误的性质和特性，并举例说明了它们的形式和变化。我们提出了一种用于MI估计的自适应划分方案，该方案利用样本数据的局部和全局分布确定的参数有效地转换样本数据，保证了更鲁棒和可靠的重建算法。与标准化度量(共享信息度量)一起，我们报告了计算机和现实世界生物网络的显著增强的性能。我们还发现，对于假阳性率的中间范围，真实交互的恢复尤其好，这表明我们的算法不太容易受到虚假关联信号的影响。

{"title":"Adaptive input data transformation for improved network reconstruction with information theoretic algorithms.","authors":"Venkateshan Kannan, Jesper Tegner","doi":"10.1515/sagmb-2016-0013","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0013","url":null,"abstract":"We propose a novel systematic procedure of non-linear data transformation for an adaptive algorithm in the context of network reverse-engineering using information theoretic methods. Our methodology is rooted in elucidating and correcting for the specific biases in the estimation techniques for mutual information (MI) given a finite sample of data. These are, in turn, tied to lack of well-defined bounds for numerical estimation of MI for continuous probability distributions from finite data. The nature and properties of the inevitable bias is described, complemented by several examples illustrating their form and variation. We propose an adaptive partitioning scheme for MI estimation that effectively transforms the sample data using parameters determined from its local and global distribution guaranteeing a more robust and reliable reconstruction algorithm. Together with a normalized measure (Shared Information Metric) we report considerably enhanced performance both for in silico and real-world biological networks. We also find that the recovery of true interactions is in particular better for intermediate range of false positive rates, suggesting that our algorithm is less vulnerable to spurious signals of association.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 6","pages":"507-520"},"PeriodicalIF":0.9,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0013","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39981825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0