Consistency and convergence rate of phylogenetic inference via regularization.

IF 3.7 1区数学 Q1 STATISTICS & PROBABILITY Annals of Statistics Pub Date : 2018-08-01 Epub Date: 2018-06-27 DOI:10.1214/17-AOS1592

Vu Dinh, Lam Si Tung Ho, Marc A Suchard, Frederick A Matsen

{"title":"Consistency and convergence rate of phylogenetic inference via regularization.","authors":"Vu Dinh, Lam Si Tung Ho, Marc A Suchard, Frederick A Matsen","doi":"10.1214/17-AOS1592","DOIUrl":null,"url":null,"abstract":"It is common in phylogenetics to have some, perhaps partial, information about the overall evolutionary tree of a group of organisms and wish to find an evolutionary tree of a specific gene for those organisms. There may not be enough information in the gene sequences alone to accurately reconstruct the correct \"gene tree.\" Although the gene tree may deviate from the \"species tree\" due to a variety of genetic processes, in the absence of evidence to the contrary it is parsimonious to assume that they agree. A common statistical approach in these situations is to develop a likelihood penalty to incorporate such additional information. Recent studies using simulation and empirical data suggest that a likelihood penalty quantifying concordance with a species tree can significantly improve the accuracy of gene tree reconstruction compared to using sequence data alone. However, the consistency of such an approach has not yet been established, nor have convergence rates been bounded. Because phylogenetics is a non-standard inference problem, the standard theory does not apply. In this paper, we propose a penalized maximum likelihood estimator for gene tree reconstruction, where the penalty is the square of the Billera-Holmes-Vogtmann geodesic distance from the gene tree to the species tree. We prove that this method is consistent, and derive its convergence rate for estimating the discrete gene tree structure and continuous edge lengths (representing the amount of evolution that has occurred on that branch) simultaneously. We find that the regularized estimator is \"adaptive fast converging,\" meaning that it can reconstruct all edges of length greater than any given threshold from gene sequences of polynomial length. Our method does not require the species tree to be known exactly; in fact, our asymptotic theory holds for any such guide tree.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":"46 4","pages":"1481-1512"},"PeriodicalIF":3.7000,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/17-AOS1592","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1214/17-AOS1592","RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2018/6/27 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 8

Abstract

It is common in phylogenetics to have some, perhaps partial, information about the overall evolutionary tree of a group of organisms and wish to find an evolutionary tree of a specific gene for those organisms. There may not be enough information in the gene sequences alone to accurately reconstruct the correct "gene tree." Although the gene tree may deviate from the "species tree" due to a variety of genetic processes, in the absence of evidence to the contrary it is parsimonious to assume that they agree. A common statistical approach in these situations is to develop a likelihood penalty to incorporate such additional information. Recent studies using simulation and empirical data suggest that a likelihood penalty quantifying concordance with a species tree can significantly improve the accuracy of gene tree reconstruction compared to using sequence data alone. However, the consistency of such an approach has not yet been established, nor have convergence rates been bounded. Because phylogenetics is a non-standard inference problem, the standard theory does not apply. In this paper, we propose a penalized maximum likelihood estimator for gene tree reconstruction, where the penalty is the square of the Billera-Holmes-Vogtmann geodesic distance from the gene tree to the species tree. We prove that this method is consistent, and derive its convergence rate for estimating the discrete gene tree structure and continuous edge lengths (representing the amount of evolution that has occurred on that branch) simultaneously. We find that the regularized estimator is "adaptive fast converging," meaning that it can reconstruct all edges of length greater than any given threshold from gene sequences of polynomial length. Our method does not require the species tree to be known exactly; in fact, our asymptotic theory holds for any such guide tree.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于正则化的系统发育推理的一致性和收敛率。

在系统发育学中，有一些关于一组生物的整体进化树的信息，也许是部分信息，并希望找到这些生物的特定基因的进化树，这是很常见的。基因序列中可能没有足够的信息来准确地重建正确的“基因树”。尽管由于各种遗传过程，基因树可能偏离“物种树”，但在缺乏相反证据的情况下，假设它们一致是吝啬的。在这种情况下，一种常见的统计方法是制定一个可能性惩罚，以纳入这些额外的信息。最近的研究利用模拟和经验数据表明，与单独使用序列数据相比，量化物种树一致性的似然惩罚可以显著提高基因树重建的准确性。但是，这种方法的一致性尚未确定，收敛速度也没有限定。因为系统发育是一个非标准的推理问题，所以标准理论并不适用。在本文中，我们提出了一种用于基因树重建的惩罚极大似然估计，其中惩罚是基因树到物种树的Billera-Holmes-Vogtmann测地距离的平方。我们证明了这种方法是一致的，并推导了同时估计离散基因树结构和连续边缘长度(表示该分支上发生的进化量)的收敛速度。我们发现正则化估计器是“自适应快速收敛”的，这意味着它可以从多项式长度的基因序列中重建长度大于任何给定阈值的所有边。我们的方法不需要确切地知道物种树;事实上，我们的渐近理论适用于任何这样的导树。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Annals of Statistics 数学-统计学与概率论

CiteScore

9.30

自引率

8.90%

发文量

119

审稿时长

6-12 weeks

期刊介绍： The Annals of Statistics aim to publish research papers of highest quality reflecting the many facets of contemporary statistics. Primary emphasis is placed on importance and originality, not on formalism. The journal aims to cover all areas of statistics, especially mathematical statistics and applied & interdisciplinary statistics. Of course many of the best papers will touch on more than one of these general areas, because the discipline of statistics has deep roots in mathematics, and in substantive scientific fields.