Statistical Applications in Genetics and Molecular Biology最新文献_第3页

Comparisons of classification methods for viral genomes and protein families using alignment-free vectorization. 病毒基因组和蛋白家族分类方法的比较。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2018-06-30 DOI: 10.1515/sagmb-2018-0004

Hsin-Hsiung Huang, Shuai Hao, Saul Alarcon, Jie Yang

Abstract In this paper, we propose a statistical classification method based on discriminant analysis using the first and second moments of positions of each nucleotide of the genome sequences as features, and compare its performances with other classification methods as well as natural vector for comparative genomic analysis. We examine the normality of the proposed features. The statistical classification models used including linear discriminant analysis, quadratic discriminant analysis, diagonal linear discriminant analysis, k-nearest-neighbor classifier, logistic regression, support vector machines, and classification trees. All these classifiers are tested on a viral genome dataset and a protein dataset for predicting viral Baltimore labels, viral family labels, and protein family labels.

引用次数: 4

A statistical method for measuring activation of gene regulatory networks. 一种测量基因调控网络激活的统计方法。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2018-06-13 DOI: 10.1515/sagmb-2016-0059

Gustavo H Esteves, Luiz F L Reis

Motivation: Gene expression data analysis is of great importance for modern molecular biology, given our ability to measure the expression profiles of thousands of genes and enabling studies rooted in systems biology. In this work, we propose a simple statistical model for the activation measuring of gene regulatory networks, instead of the traditional gene co-expression networks.

Results: We present the mathematical construction of a statistical procedure for testing hypothesis regarding gene regulatory network activation. The real probability distribution for the test statistic is evaluated by a permutation based study. To illustrate the functionality of the proposed methodology, we also present a simple example based on a small hypothetical network and the activation measuring of two KEGG networks, both based on gene expression data collected from gastric and esophageal samples. The two KEGG networks were also analyzed for a public database, available through NCBI-GEO, presented as Supplementary Material.

Availability: This method was implemented in an R package that is available at the BioConductor project website under the name maigesPack.

动机:基因表达数据分析对现代分子生物学非常重要，因为我们有能力测量成千上万个基因的表达谱，并使植根于系统生物学的研究成为可能。在这项工作中，我们提出了一个简单的统计模型来测量基因调控网络的激活，而不是传统的基因共表达网络。结果:我们提出了一个检验基因调控网络激活假设的统计程序的数学结构。检验统计量的真实概率分布通过基于排列的研究来评估。为了说明所提出的方法的功能，我们还提出了一个简单的例子，基于一个小的假设网络和两个KEGG网络的激活测量，两者都基于从胃和食管样本收集的基因表达数据。还对两个KEGG网络进行了分析，以便通过NCBI-GEO提供一个公共数据库，作为补充材料。可用性:该方法是在一个R包中实现的，可以在BioConductor项目网站上以maigesPack的名称获得。

{"title":"A statistical method for measuring activation of gene regulatory networks.","authors":"Gustavo H Esteves, Luiz F L Reis","doi":"10.1515/sagmb-2016-0059","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0059","url":null,"abstract":"Motivation: Gene expression data analysis is of great importance for modern molecular biology, given our ability to measure the expression profiles of thousands of genes and enabling studies rooted in systems biology. In this work, we propose a simple statistical model for the activation measuring of gene regulatory networks, instead of the traditional gene co-expression networks.Results: We present the mathematical construction of a statistical procedure for testing hypothesis regarding gene regulatory network activation. The real probability distribution for the test statistic is evaluated by a permutation based study. To illustrate the functionality of the proposed methodology, we also present a simple example based on a small hypothetical network and the activation measuring of two KEGG networks, both based on gene expression data collected from gastric and esophageal samples. The two KEGG networks were also analyzed for a public database, available through NCBI-GEO, presented as Supplementary Material.Availability: This method was implemented in an R package that is available at the BioConductor project website under the name maigesPack.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 3","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0059","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36218900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Multi-locus data distinguishes between population growth and multiple merger coalescents. 多位点数据区分了人口增长和多重合并凝聚。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2018-06-13 DOI: 10.1515/sagmb-2017-0011

Jere Koskela

We introduce a low dimensional function of the site frequency spectrum that is tailor-made for distinguishing coalescent models with multiple mergers from Kingman coalescent models with population growth, and use this function to construct a hypothesis test between these model classes. The null and alternative sampling distributions of the statistic are intractable, but its low dimensionality renders them amenable to Monte Carlo estimation. We construct kernel density estimates of the sampling distributions based on simulated data, and show that the resulting hypothesis test dramatically improves on the statistical power of a current state-of-the-art method. A key reason for this improvement is the use of multi-locus data, in particular averaging observed site frequency spectra across unlinked loci to reduce sampling variance. We also demonstrate the robustness of our method to nuisance and tuning parameters. Finally we show that the same kernel density estimates can be used to conduct parameter estimation, and argue that our method is readily generalisable for applications in model selection, parameter inference and experimental design.

我们引入了一个专门用于区分具有多个合并的聚结模型和具有人口增长的Kingman聚结模型的低维站点频谱函数，并使用该函数构建了这些模型类别之间的假设检验。统计量的零抽样和备选抽样分布是难以处理的，但它的低维性使它们适合蒙特卡罗估计。我们基于模拟数据构建抽样分布的核密度估计，并表明由此产生的假设检验显着提高了当前最先进方法的统计能力。这种改进的一个关键原因是使用了多位点数据，特别是在非连锁位点上平均观察到的位点频谱以减少采样方差。我们还证明了我们的方法对干扰和调优参数的鲁棒性。最后，我们证明了同样的核密度估计可以用来进行参数估计，并认为我们的方法很容易推广到模型选择，参数推理和实验设计的应用。

{"title":"Multi-locus data distinguishes between population growth and multiple merger coalescents.","authors":"Jere Koskela","doi":"10.1515/sagmb-2017-0011","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0011","url":null,"abstract":"We introduce a low dimensional function of the site frequency spectrum that is tailor-made for distinguishing coalescent models with multiple mergers from Kingman coalescent models with population growth, and use this function to construct a hypothesis test between these model classes. The null and alternative sampling distributions of the statistic are intractable, but its low dimensionality renders them amenable to Monte Carlo estimation. We construct kernel density estimates of the sampling distributions based on simulated data, and show that the resulting hypothesis test dramatically improves on the statistical power of a current state-of-the-art method. A key reason for this improvement is the use of multi-locus data, in particular averaging observed site frequency spectra across unlinked loci to reduce sampling variance. We also demonstrate the robustness of our method to nuisance and tuning parameters. Finally we show that the same kernel density estimates can be used to conduct parameter estimation, and argue that our method is readily generalisable for applications in model selection, parameter inference and experimental design.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 3","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0011","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36218898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Non-parametric estimation of population size changes from the site frequency spectrum. 非参数估计群体规模变化从站点频谱。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2018-06-11 DOI: 10.1515/sagmb-2017-0061

Berit Lindum Waltoft, Asger Hobolth

Changes in population size is a useful quantity for understanding the evolutionary history of a species. Genetic variation within a species can be summarized by the site frequency spectrum (SFS). For a sample of size n, the SFS is a vector of length n - 1 where entry i is the number of sites where the mutant base appears i times and the ancestral base appears n - i times. We present a new method, CubSFS, for estimating the changes in population size of a panmictic population from an observed SFS. First, we provide a straightforward proof for the expression of the expected site frequency spectrum depending only on the population size. Our derivation is based on an eigenvalue decomposition of the instantaneous coalescent rate matrix. Second, we solve the inverse problem of determining the changes in population size from an observed SFS. Our solution is based on a cubic spline for the population size. The cubic spline is determined by minimizing the weighted average of two terms, namely (i) the goodness of fit to the observed SFS, and (ii) a penalty term based on the smoothness of the changes. The weight is determined by cross-validation. The new method is validated on simulated demographic histories and applied on unfolded and folded SFS from 26 different human populations from the 1000 Genomes Project.

种群大小的变化是了解一个物种进化史的有用数量。一个物种内的遗传变异可以用位点频谱(site frequency spectrum, SFS)来概括。对于大小为n的样本，SFS是长度为n - 1的向量，其中入口i是突变基出现i次和祖先基出现n - i次的位点数。我们提出了一种新的方法，CubSFS，用于估计从一个观察到的SFS种群大小的变化。首先，我们提供了一个简单的证明，表明仅依赖于种群大小的期望站点频谱的表达式。我们的推导是基于瞬时凝聚速率矩阵的特征值分解。其次，我们解决了从观测到的SFS确定种群大小变化的逆问题。我们的解决方案是基于种群大小的三次样条。三次样条是通过最小化两项的加权平均值来确定的，即(i)与观测到的SFS的拟合优度，以及(ii)基于变化的平滑性的惩罚项。权重由交叉验证确定。新方法在模拟人口统计学历史上进行了验证，并应用于1000基因组计划中26个不同人群的未折叠和折叠SFS。

{"title":"Non-parametric estimation of population size changes from the site frequency spectrum.","authors":"Berit Lindum Waltoft, Asger Hobolth","doi":"10.1515/sagmb-2017-0061","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0061","url":null,"abstract":"Changes in population size is a useful quantity for understanding the evolutionary history of a species. Genetic variation within a species can be summarized by the site frequency spectrum (SFS). For a sample of size n, the SFS is a vector of length n - 1 where entry i is the number of sites where the mutant base appears i times and the ancestral base appears n - i times. We present a new method, CubSFS, for estimating the changes in population size of a panmictic population from an observed SFS. First, we provide a straightforward proof for the expression of the expected site frequency spectrum depending only on the population size. Our derivation is based on an eigenvalue decomposition of the instantaneous coalescent rate matrix. Second, we solve the inverse problem of determining the changes in population size from an observed SFS. Our solution is based on a cubic spline for the population size. The cubic spline is determined by minimizing the weighted average of two terms, namely (i) the goodness of fit to the observed SFS, and (ii) a penalty term based on the smoothness of the changes. The weight is determined by cross-validation. The new method is validated on simulated demographic histories and applied on unfolded and folded SFS from 26 different human populations from the 1000 Genomes Project.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 3","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0061","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36209410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bayesian inference of selection in the Wright-Fisher diffusion model. Wright-Fisher扩散模型中贝叶斯选择推理。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2018-06-06 DOI: 10.1515/sagmb-2017-0046

Jeffrey J Gory, Radu Herbei, Laura S Kubatko

The increasing availability of population-level allele frequency data across one or more related populations necessitates the development of methods that can efficiently estimate population genetics parameters, such as the strength of selection acting on the population(s), from such data. Existing methods for this problem in the setting of the Wright-Fisher diffusion model are primarily likelihood-based, and rely on numerical approximation for likelihood computation and on bootstrapping for assessment of variability in the resulting estimates, requiring extensive computation. Recent work has provided a method for obtaining exact samples from general Wright-Fisher diffusion processes, enabling the development of methods for Bayesian estimation in this setting. We develop and implement a Bayesian method for estimating the strength of selection based on the Wright-Fisher diffusion for data sampled at a single time point. The method utilizes the latest algorithms for exact sampling to devise a Markov chain Monte Carlo procedure to draw samples from the joint posterior distribution of the selection coefficient and the allele frequencies. We demonstrate that when assumptions about the initial allele frequencies are accurate the method performs well for both simulated data and for an empirical data set on hypoxia in flies, where we find evidence for strong positive selection in a region of chromosome 2L previously identified. We discuss possible extensions of our method to the more general settings commonly encountered in practice, highlighting the advantages of Bayesian approaches to inference in this setting.

在一个或多个相关种群中，种群水平等位基因频率数据的可用性越来越高，这就需要开发能够有效地估计种群遗传参数的方法，例如从这些数据中估计作用于种群的选择强度。在Wright-Fisher扩散模型的背景下，解决这个问题的现有方法主要是基于似然的，并且依赖于似然计算的数值近似和评估结果估计的可变性的自举，需要大量的计算。最近的工作提供了一种从一般Wright-Fisher扩散过程中获得精确样本的方法，从而能够在这种情况下开发贝叶斯估计方法。我们开发并实现了一种贝叶斯方法，用于估计在单个时间点采样的数据的基于Wright-Fisher扩散的选择强度。该方法利用最新的精确抽样算法，设计了马尔科夫链蒙特卡罗程序，从选择系数和等位基因频率的联合后验分布中抽取样本。我们证明，当关于初始等位基因频率的假设是准确的，该方法在模拟数据和果蝇缺氧的经验数据集上都表现良好，在这些数据集上，我们发现了在染色体2L区域中存在强阳性选择的证据。我们讨论了将我们的方法扩展到实践中经常遇到的更一般的设置的可能性，强调了贝叶斯方法在这种设置中的推断优势。

{"title":"Bayesian inference of selection in the Wright-Fisher diffusion model.","authors":"Jeffrey J Gory, Radu Herbei, Laura S Kubatko","doi":"10.1515/sagmb-2017-0046","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0046","url":null,"abstract":"The increasing availability of population-level allele frequency data across one or more related populations necessitates the development of methods that can efficiently estimate population genetics parameters, such as the strength of selection acting on the population(s), from such data. Existing methods for this problem in the setting of the Wright-Fisher diffusion model are primarily likelihood-based, and rely on numerical approximation for likelihood computation and on bootstrapping for assessment of variability in the resulting estimates, requiring extensive computation. Recent work has provided a method for obtaining exact samples from general Wright-Fisher diffusion processes, enabling the development of methods for Bayesian estimation in this setting. We develop and implement a Bayesian method for estimating the strength of selection based on the Wright-Fisher diffusion for data sampled at a single time point. The method utilizes the latest algorithms for exact sampling to devise a Markov chain Monte Carlo procedure to draw samples from the joint posterior distribution of the selection coefficient and the allele frequencies. We demonstrate that when assumptions about the initial allele frequencies are accurate the method performs well for both simulated data and for an empirical data set on hypoxia in flies, where we find evidence for strong positive selection in a region of chromosome 2L previously identified. We discuss possible extensions of our method to the more general settings commonly encountered in practice, highlighting the advantages of Bayesian approaches to inference in this setting.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 3","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0046","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36198083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A Bayesian hierarchical model for identifying significant polygenic effects while controlling for confounding and repeated measures. 一个贝叶斯层次模型，用于识别显著的多基因效应，同时控制混淆和重复测量。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-11-27 DOI: 10.1515/sagmb-2017-0044

Christopher McMahan, James Baurley, William Bridges, Chase Joyner, Muhamad Fitra Kacamarga, Robert Lund, Carissa Pardamean, Bens Pardamean

Genomic studies of plants often seek to identify genetic factors associated with desirable traits. The process of evaluating genetic markers one by one (i.e. a marginal analysis) may not identify important polygenic and environmental effects. Further, confounding due to growing conditions/factors and genetic similarities among plant varieties may influence conclusions. When developing new plant varieties to optimize yield or thrive in future adverse conditions (e.g. flood, drought), scientists seek a complete understanding of how the factors influence desirable traits. Motivated by a study design that measures rice yield across different seasons, fields, and plant varieties in Indonesia, we develop a regression method that identifies significant genomic factors, while simultaneously controlling for field factors and genetic similarities in the plant varieties. Our approach develops a Bayesian maximum a posteriori probability (MAP) estimator under a generalized double Pareto shrinkage prior. Through a hierarchical representation of the proposed model, a novel and computationally efficient expectation-maximization (EM) algorithm is developed for variable selection and estimation. The performance of the proposed approach is demonstrated through simulation and is used to analyze rice yields from a pilot study conducted by the Indonesian Center for Rice Research.

植物基因组研究通常寻求确定与理想性状相关的遗传因素。逐个评估遗传标记的过程(即边际分析)可能无法识别重要的多基因和环境影响。此外，由于生长条件/因素和植物品种之间的遗传相似性造成的混淆可能会影响结论。在开发新的植物品种以优化产量或在未来的不利条件下(如洪水、干旱)茁壮成长时，科学家们寻求对这些因素如何影响理想性状的全面了解。受测量印度尼西亚不同季节、田地和植物品种的水稻产量的研究设计的启发，我们开发了一种回归方法，该方法可以识别重要的基因组因素，同时控制田地因素和植物品种的遗传相似性。我们的方法开发了广义双帕累托收缩先验下的贝叶斯极大后验概率(MAP)估计。通过对所提模型的分层表示，提出了一种新的、计算效率高的期望最大化(EM)算法，用于变量选择和估计。通过模拟证明了所提出方法的性能，并用于分析印度尼西亚水稻研究中心进行的一项试点研究的水稻产量。

{"title":"A Bayesian hierarchical model for identifying significant polygenic effects while controlling for confounding and repeated measures.","authors":"Christopher McMahan, James Baurley, William Bridges, Chase Joyner, Muhamad Fitra Kacamarga, Robert Lund, Carissa Pardamean, Bens Pardamean","doi":"10.1515/sagmb-2017-0044","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0044","url":null,"abstract":"Genomic studies of plants often seek to identify genetic factors associated with desirable traits. The process of evaluating genetic markers one by one (i.e. a marginal analysis) may not identify important polygenic and environmental effects. Further, confounding due to growing conditions/factors and genetic similarities among plant varieties may influence conclusions. When developing new plant varieties to optimize yield or thrive in future adverse conditions (e.g. flood, drought), scientists seek a complete understanding of how the factors influence desirable traits. Motivated by a study design that measures rice yield across different seasons, fields, and plant varieties in Indonesia, we develop a regression method that identifies significant genomic factors, while simultaneously controlling for field factors and genetic similarities in the plant varieties. Our approach develops a Bayesian maximum a posteriori probability (MAP) estimator under a generalized double Pareto shrinkage prior. Through a hierarchical representation of the proposed model, a novel and computationally efficient expectation-maximization (EM) algorithm is developed for variable selection and estimation. The performance of the proposed approach is demonstrated through simulation and is used to analyze rice yields from a pilot study conducted by the Indonesian Center for Rice Research.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 5-6","pages":"407-419"},"PeriodicalIF":0.9,"publicationDate":"2017-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0044","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35613840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

A smoothed EM-algorithm for DNA methylation profiles from sequencing-based methods in cell lines or for a single cell type. 一种平滑的em算法，用于细胞系或单个细胞类型中基于测序的方法的DNA甲基化谱。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-11-27 DOI: 10.1515/sagmb-2016-0062

Lajmi Lakhal-Chaieb, Celia M T Greenwood, Mohamed Ouhourane, Kaiqiong Zhao, Belkacem Abdous, Karim Oualkacha

We consider the assessment of DNA methylation profiles for sequencing-derived data from a single cell type or from cell lines. We derive a kernel smoothed EM-algorithm, capable of analyzing an entire chromosome at once, and to simultaneously correct for experimental errors arising from either the pre-treatment steps or from the sequencing stage and to take into account spatial correlations between DNA methylation profiles at neighbouring CpG sites. The outcomes of our algorithm are then used to (i) call the true methylation status at each CpG site, (ii) provide accurate smoothed estimates of DNA methylation levels, and (iii) detect differentially methylated regions. Simulations show that the proposed methodology outperforms existing analysis methods that either ignore the correlation between DNA methylation profiles at neighbouring CpG sites or do not correct for errors. The use of the proposed inference procedure is illustrated through the analysis of a publicly available data set from a cell line of induced pluripotent H9 human embryonic stem cells and also a data set where methylation measures were obtained for a small genomic region in three different immune cell types separated from whole blood.

我们考虑评估来自单个细胞类型或细胞系的测序衍生数据的DNA甲基化谱。我们推导了一种核平滑em算法，能够一次分析整个染色体，同时纠正由预处理步骤或测序阶段引起的实验错误，并考虑到邻近CpG位点DNA甲基化谱之间的空间相关性。然后，我们的算法结果用于(i)调用每个CpG位点的真实甲基化状态，(ii)提供准确的DNA甲基化水平平滑估计，以及(iii)检测差异甲基化区域。模拟表明，所提出的方法优于现有的分析方法，这些分析方法要么忽略邻近CpG位点DNA甲基化谱之间的相关性，要么不纠正错误。通过对来自诱导多能H9人胚胎干细胞细胞系的公开可用数据集的分析，以及从全血分离的三种不同免疫细胞类型中获得小基因组区域甲基化测量的数据集，说明了所提出的推断程序的使用。

{"title":"A smoothed EM-algorithm for DNA methylation profiles from sequencing-based methods in cell lines or for a single cell type.","authors":"Lajmi Lakhal-Chaieb, Celia M T Greenwood, Mohamed Ouhourane, Kaiqiong Zhao, Belkacem Abdous, Karim Oualkacha","doi":"10.1515/sagmb-2016-0062","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0062","url":null,"abstract":"We consider the assessment of DNA methylation profiles for sequencing-derived data from a single cell type or from cell lines. We derive a kernel smoothed EM-algorithm, capable of analyzing an entire chromosome at once, and to simultaneously correct for experimental errors arising from either the pre-treatment steps or from the sequencing stage and to take into account spatial correlations between DNA methylation profiles at neighbouring CpG sites. The outcomes of our algorithm are then used to (i) call the true methylation status at each CpG site, (ii) provide accurate smoothed estimates of DNA methylation levels, and (iii) detect differentially methylated regions. Simulations show that the proposed methodology outperforms existing analysis methods that either ignore the correlation between DNA methylation profiles at neighbouring CpG sites or do not correct for errors. The use of the proposed inference procedure is illustrated through the analysis of a publicly available data set from a cell line of induced pluripotent H9 human embryonic stem cells and also a data set where methylation measures were obtained for a small genomic region in three different immune cell types separated from whole blood.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 5-6","pages":"333-347"},"PeriodicalIF":0.9,"publicationDate":"2017-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0062","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35628756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Modifying SAMseq to account for asymmetry in the distribution of effect sizes when identifying differentially expressed genes. 在鉴定差异表达基因时，修改SAMseq以解释效应大小分布的不对称性。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-11-27 DOI: 10.1515/sagmb-2016-0037

Ekua Kotoka, Megan Orr

RNA-Seq is a developing technology for generating gene expression data by directly sequencing mRNA molecules in a sample. RNA-Seq data consist of counts of reads recorded to a particular gene that are often used to identify differentially expressed (DE) genes. A common statistical method used to analyze RNA-Seq data is Significance Analysis of Microarray with emphasis on RNA-Seq data (SAMseq). SAMseq is a nonparametric method that uses a resampling technique to account for differences in sequencing depths when identifying DE genes. We propose a modification of this method that takes into account asymmetry in the distribution of the effect sizes by taking into account the sign of the test statistics. Through simulation studies, we showthat the proposed method, comparedwith the traditional SAMseqmethod and other existing methods provides better power for identifying truly DE genes or more sufficiently controls FDR in most settings where asymmetry is present. We illustrate the use of the proposed method by analyzing an RNA-Seq data set containing C57BL/6J (B6) and DBA/2J (D2) mouse strains samples.

RNA-Seq是一种通过直接测序样品中的mRNA分子来生成基因表达数据的新兴技术。RNA-Seq数据包括记录到特定基因的读取计数，通常用于识别差异表达(DE)基因。用于分析RNA-Seq数据的常用统计方法是强调RNA-Seq数据的微阵列显著性分析(SAMseq)。SAMseq是一种非参数方法，在鉴定DE基因时使用重采样技术来解释测序深度的差异。我们建议对这种方法进行修改，通过考虑检验统计量的符号来考虑效应大小分布的不对称性。通过仿真研究，我们表明，与传统的SAMseqmethod和其他现有方法相比，所提出的方法在大多数不对称存在的情况下，能够更好地识别真正的DE基因或更充分地控制FDR。我们通过分析包含C57BL/6J (B6)和DBA/2J (D2)小鼠品系样本的RNA-Seq数据集来说明该方法的使用。

{"title":"Modifying SAMseq to account for asymmetry in the distribution of effect sizes when identifying differentially expressed genes.","authors":"Ekua Kotoka, Megan Orr","doi":"10.1515/sagmb-2016-0037","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0037","url":null,"abstract":"RNA-Seq is a developing technology for generating gene expression data by directly sequencing mRNA molecules in a sample. RNA-Seq data consist of counts of reads recorded to a particular gene that are often used to identify differentially expressed (DE) genes. A common statistical method used to analyze RNA-Seq data is Significance Analysis of Microarray with emphasis on RNA-Seq data (SAMseq). SAMseq is a nonparametric method that uses a resampling technique to account for differences in sequencing depths when identifying DE genes. We propose a modification of this method that takes into account asymmetry in the distribution of the effect sizes by taking into account the sign of the test statistics. Through simulation studies, we showthat the proposed method, comparedwith the traditional SAMseqmethod and other existing methods provides better power for identifying truly DE genes or more sufficiently controls FDR in most settings where asymmetry is present. We illustrate the use of the proposed method by analyzing an RNA-Seq data set containing C57BL/6J (B6) and DBA/2J (D2) mouse strains samples.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 5-6","pages":"291-312"},"PeriodicalIF":0.9,"publicationDate":"2017-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0037","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35645921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Approximate maximum likelihood estimation for population genetic inference. 群体遗传推断的近似最大似然估计。

IF 0.8 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-11-27 DOI: 10.1515/sagmb-2017-0016

Johanna Bertl, Gregory Ewing, Carolin Kosiol, Andreas Futschik

In many population genetic problems, parameter estimation is obstructed by an intractable likelihood function. Therefore, approximate estimation methods have been developed, and with growing computational power, sampling-based methods became popular. However, these methods such as Approximate Bayesian Computation (ABC) can be inefficient in high-dimensional problems. This led to the development of more sophisticated iterative estimation methods like particle filters. Here, we propose an alternative approach that is based on stochastic approximation. By moving along a simulated gradient or ascent direction, the algorithm produces a sequence of estimates that eventually converges to the maximum likelihood estimate, given a set of observed summary statistics. This strategy does not sample much from low-likelihood regions of the parameter space, and is fast, even when many summary statistics are involved. We put considerable efforts into providing tuning guidelines that improve the robustness and lead to good performance on problems with high-dimensional summary statistics and a low signal-to-noise ratio. We then investigate the performance of our resulting approach and study its properties in simulations. Finally, we re-estimate parameters describing the demographic history of Bornean and Sumatran orang-utans.

在许多种群遗传问题中，参数估计受到难以处理的似然函数的阻碍。因此，近似估计方法得到了发展，随着计算能力的提高，基于抽样的方法开始流行。然而，这些方法，如近似贝叶斯计算（ABC），在高维问题中是低效的。这导致了更复杂的迭代估计方法的发展，如粒子滤波器。在这里，我们提出了一种基于随机近似的替代方法。通过沿着模拟的梯度或上升方向移动，该算法产生一系列估计，最终收敛于给定一组观察到的汇总统计数据的最大似然估计。这种策略不需要从参数空间的低似然区域中抽取太多样本，而且速度很快，即使涉及到许多汇总统计数据。我们付出了相当大的努力来提供调优指南，以提高鲁棒性，并在具有高维汇总统计和低信噪比的问题上获得良好的性能。然后我们研究了我们得到的方法的性能，并在模拟中研究了它的特性。最后，我们重新估计描述婆罗洲和苏门答腊猩猩人口历史的参数。

{"title":"Approximate maximum likelihood estimation for population genetic inference.","authors":"Johanna Bertl, Gregory Ewing, Carolin Kosiol, Andreas Futschik","doi":"10.1515/sagmb-2017-0016","DOIUrl":"10.1515/sagmb-2017-0016","url":null,"abstract":"In many population genetic problems, parameter estimation is obstructed by an intractable likelihood function. Therefore, approximate estimation methods have been developed, and with growing computational power, sampling-based methods became popular. However, these methods such as Approximate Bayesian Computation (ABC) can be inefficient in high-dimensional problems. This led to the development of more sophisticated iterative estimation methods like particle filters. Here, we propose an alternative approach that is based on stochastic approximation. By moving along a simulated gradient or ascent direction, the algorithm produces a sequence of estimates that eventually converges to the maximum likelihood estimate, given a set of observed summary statistics. This strategy does not sample much from low-likelihood regions of the parameter space, and is fast, even when many summary statistics are involved. We put considerable efforts into providing tuning guidelines that improve the robustness and lead to good performance on problems with high-dimensional summary statistics and a low signal-to-noise ratio. We then investigate the performance of our resulting approach and study its properties in simulations. Finally, we re-estimate parameters describing the demographic history of Bornean and Sumatran orang-utans.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 5-6","pages":"387-405"},"PeriodicalIF":0.8,"publicationDate":"2017-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35215570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multivariate association between single-nucleotide polymorphisms in Alzgene linkage regions and structural changes in the brain: discovery, refinement and validation. Alzgene连锁区域的单核苷酸多态性与大脑结构变化之间的多变量关联:发现、改进和验证。

IF 0.9 4区数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2017-11-27 DOI: 10.1515/sagmb-2016-0077

Elena Szefer, Donghuan Lu, Farouk Nathoo, Mirza Faisal Beg, Jinko Graham

Using publicly-available data from the Alzheimer's Disease Neuroimaging Initiative, we investigate the joint association between single-nucleotide polymorphisms (SNPs) in previously established linkage regions for Alzheimer's disease (AD) and rates of decline in brain structure. In an initial, discovery stage of analysis, we applied a weighted RV test to assess the association between 75,845 SNPs in the Alzgene linkage regions and rates of change in structural MRI measurements for 56 brain regions affected by AD, in 632 subjects. After confirming association, we selected refined lists of 1694 and 22 SNPs via a bootstrap-enhanced sparse canonical correlation analysis. In a final, validation stage, we confirmed association between the refined list of 1694 SNPs and the imaging phenotypes in an independent data set. Genes corresponding to priority SNPs having the highest contribution in the validation data have previously been implicated or hypothesized to be implicated in AD, including GCLC, IDE, and STAMBP1andFAS. Though the effect sizes of the 1694 SNPs in the priority set are likely small, further investigation within this set may advance understanding of the missing heritability in AD. Our analysis addresses challenges in current imaging-genetics studies such as biased sampling designs and high-dimensional data with low association signal.

利用来自阿尔茨海默病神经影像学倡议的公开数据，我们研究了先前建立的阿尔茨海默病(AD)连锁区域的单核苷酸多态性(snp)与大脑结构下降率之间的联合关联。在最初的发现分析阶段，我们应用加权RV检验来评估632名受试者中Alzgene连锁区域的75,845个snp与AD影响的56个大脑区域的结构MRI测量变化率之间的关系。在确认关联后，我们通过引导增强的稀疏典型相关分析选择了1694个和22个SNPs的精炼列表。在最后的验证阶段，我们在一个独立的数据集中确认了1694个SNPs的精炼列表与成像表型之间的关联。在验证数据中贡献最大的优先snp对应的基因先前被认为与AD有关或被假设与AD有关，包括GCLC、IDE和STAMBP1andFAS。尽管优先集中的1694个snp的效应量可能很小，但在该集中的进一步研究可能会促进对AD缺失遗传力的理解。我们的分析解决了当前成像遗传学研究中的挑战，如偏抽样设计和低关联信号的高维数据。

{"title":"Multivariate association between single-nucleotide polymorphisms in Alzgene linkage regions and structural changes in the brain: discovery, refinement and validation.","authors":"Elena Szefer, Donghuan Lu, Farouk Nathoo, Mirza Faisal Beg, Jinko Graham","doi":"10.1515/sagmb-2016-0077","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0077","url":null,"abstract":"Using publicly-available data from the Alzheimer's Disease Neuroimaging Initiative, we investigate the joint association between single-nucleotide polymorphisms (SNPs) in previously established linkage regions for Alzheimer's disease (AD) and rates of decline in brain structure. In an initial, discovery stage of analysis, we applied a weighted RV test to assess the association between 75,845 SNPs in the Alzgene linkage regions and rates of change in structural MRI measurements for 56 brain regions affected by AD, in 632 subjects. After confirming association, we selected refined lists of 1694 and 22 SNPs via a bootstrap-enhanced sparse canonical correlation analysis. In a final, validation stage, we confirmed association between the refined list of 1694 SNPs and the imaging phenotypes in an independent data set. Genes corresponding to priority SNPs having the highest contribution in the validation data have previously been implicated or hypothesized to be implicated in AD, including GCLC, IDE, and STAMBP1andFAS. Though the effect sizes of the 1694 SNPs in the priority set are likely small, further investigation within this set may advance understanding of the missing heritability in AD. Our analysis addresses challenges in current imaging-genetics studies such as biased sampling designs and high-dimensional data with low association signal.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 5-6","pages":"349-365"},"PeriodicalIF":0.9,"publicationDate":"2017-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0077","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35211465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10