{"title":"Advances in Estimating Level-1 Phylogenetic Networks from Unrooted SNPs.","authors":"Tandy Warnow, Yasamin Tabatabaee, Steven N Evans","doi":"10.1089/cmb.2024.0710","DOIUrl":null,"url":null,"abstract":"<p><p>We address the problem of how to estimate a phylogenetic network when given single-nucleotide polymorphisms (i.e., SNPs, or bi-allelic markers that have evolved under the infinite sites assumption). We focus on level-1 phylogenetic networks (i.e., networks where the cycles are node-disjoint), since more complex networks are unidentifiable. We provide a polynomial time quartet-based method that we prove correct for reconstructing the semi-directed level-1 phylogenetic network <i>N</i>, if we are given a set of SNPs that covers all the bipartitions of <i>N</i>, even if the ancestral state is not known, provided that the cycles are of length at least 5; we also prove that an algorithm developed by Dan Gusfield in the <i>Journal of Computer and System Sciences</i> in 2005 correctly recovers semi-directed level-1 phylogenetic networks in polynomial time in this case. We present a stochastic model for DNA evolution, and we prove that the two methods (our quartet-based method and Gusfield's method) are statistically consistent estimators of the semi-directed level-1 phylogenetic network. For the case of multi-state homoplasy-free characters, we prove that our quartet-based method correctly constructs semi-directed level-1 networks under the required conditions (all cycles of length at least five), while Gusfield's algorithm cannot be used in that case. These results assume that we have access to an oracle for indicating which sites in the DNA alignment are homoplasy-free, and we show that the methods are robust, under some conditions, to oracle errors.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4000,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1089/cmb.2024.0710","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
We address the problem of how to estimate a phylogenetic network when given single-nucleotide polymorphisms (i.e., SNPs, or bi-allelic markers that have evolved under the infinite sites assumption). We focus on level-1 phylogenetic networks (i.e., networks where the cycles are node-disjoint), since more complex networks are unidentifiable. We provide a polynomial time quartet-based method that we prove correct for reconstructing the semi-directed level-1 phylogenetic network N, if we are given a set of SNPs that covers all the bipartitions of N, even if the ancestral state is not known, provided that the cycles are of length at least 5; we also prove that an algorithm developed by Dan Gusfield in the Journal of Computer and System Sciences in 2005 correctly recovers semi-directed level-1 phylogenetic networks in polynomial time in this case. We present a stochastic model for DNA evolution, and we prove that the two methods (our quartet-based method and Gusfield's method) are statistically consistent estimators of the semi-directed level-1 phylogenetic network. For the case of multi-state homoplasy-free characters, we prove that our quartet-based method correctly constructs semi-directed level-1 networks under the required conditions (all cycles of length at least five), while Gusfield's algorithm cannot be used in that case. These results assume that we have access to an oracle for indicating which sites in the DNA alignment are homoplasy-free, and we show that the methods are robust, under some conditions, to oracle errors.
我们要解决的问题是,在给定单核苷酸多态性(即 SNP 或在无限位点假设下进化的双等位基因标记)的情况下,如何估算系统发育网络。我们的重点是一级系统发生网络(即循环节点不相连的网络),因为更复杂的网络是无法识别的。我们提供了一种基于多项式时间四元组的方法,并证明了这种方法在重建半定向一级系统发生网络 N 时的正确性,如果我们给定的 SNP 集覆盖了 N 的所有双分区,即使祖先状态未知,条件是循环的长度至少为 5;我们还证明了 Dan Gusfield 于 2005 年在《计算机与系统科学杂志》(Journal of Computer and System Sciences)上开发的一种算法在这种情况下能以多项式时间正确地恢复半定向一级系统发生网络。我们提出了一个 DNA 进化的随机模型,并证明这两种方法(我们的基于四元组的方法和 Gusfield 的方法)都是半定向一级系统发育网络的统计一致的估计方法。对于多态无同源字符的情况,我们证明我们基于四重奏的方法在所需条件下(所有循环长度至少为 5)能正确构建半定向一级网络,而 Gusfield 算法不能用于这种情况。这些结果假定我们可以使用一个神谕来指示 DNA 配对中哪些位点是无同源的,我们证明了这些方法在某些条件下对神谕错误是稳健的。
期刊介绍:
Journal of Computational Biology is the leading peer-reviewed journal in computational biology and bioinformatics, publishing in-depth statistical, mathematical, and computational analysis of methods, as well as their practical impact. Available only online, this is an essential journal for scientists and students who want to keep abreast of developments in bioinformatics.
Journal of Computational Biology coverage includes:
-Genomics
-Mathematical modeling and simulation
-Distributed and parallel biological computing
-Designing biological databases
-Pattern matching and pattern detection
-Linking disparate databases and data
-New tools for computational biology
-Relational and object-oriented database technology for bioinformatics
-Biological expert system design and use
-Reasoning by analogy, hypothesis formation, and testing by machine
-Management of biological databases