{"title":"The impact of gene sequence alignment and gene tree estimation error on summary-based species network estimation","authors":"Meijun Gao, Wei Wang, Kevin J. Liu","doi":"10.1145/3535508.3545559","DOIUrl":null,"url":null,"abstract":"Thanks in part to rapid advances in next-generation sequencing technologies, recent phylogenomic studies have demonstrated the pivotal role that non-tree-like evolution plays in many parts of the Tree of Life - the evolutionary history of all life on Earth. As such, the Tree of Life is not necessarily a tree at all, but is better described by more general graph structures such as a phylogenetic network. Another key ingredient in these advances consists of the computational methods needed for reconstructing phylogenetic networks from large-scale genomic sequence data. But virtually all of these methods either require multiple sequence alignments (MSAs) as input or utilize gene trees or other inputs that are computed using MSAs. All of the input MSAs and gene trees must be estimated on empirical data. The methods themselves do not directly account for upstream estimation error, and, apart from prior studies of phylogenetic tree reconstruction and anecdotal evidence, little is understood about the impact of estimated MSA and gene tree error on downstream species network reconstruction. We therefore undertake a performance study to quantify the impact of MSA error and gene tree error on state-of-the-art phylogenetic network inference methods. Our study utilizes synthetic benchmarking data as well as genomic sequence data from mosquito and yeast. We find that upstream MSA and gene tree estimation error can have first-order effects on the accuracy of downstream network reconstruction and, to a lesser extent, its computational runtime. The effects become more pronounced on more challenging datasets with greater evolutionary divergence and more sampled taxa. Our findings highlight an important need for computational methods development: namely, scalable methods are needed to account for estimated MSA and gene tree error when reconstructing phylogenetic networks using unaligned biomolecular sequence data.","PeriodicalId":354504,"journal":{"name":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3535508.3545559","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Thanks in part to rapid advances in next-generation sequencing technologies, recent phylogenomic studies have demonstrated the pivotal role that non-tree-like evolution plays in many parts of the Tree of Life - the evolutionary history of all life on Earth. As such, the Tree of Life is not necessarily a tree at all, but is better described by more general graph structures such as a phylogenetic network. Another key ingredient in these advances consists of the computational methods needed for reconstructing phylogenetic networks from large-scale genomic sequence data. But virtually all of these methods either require multiple sequence alignments (MSAs) as input or utilize gene trees or other inputs that are computed using MSAs. All of the input MSAs and gene trees must be estimated on empirical data. The methods themselves do not directly account for upstream estimation error, and, apart from prior studies of phylogenetic tree reconstruction and anecdotal evidence, little is understood about the impact of estimated MSA and gene tree error on downstream species network reconstruction. We therefore undertake a performance study to quantify the impact of MSA error and gene tree error on state-of-the-art phylogenetic network inference methods. Our study utilizes synthetic benchmarking data as well as genomic sequence data from mosquito and yeast. We find that upstream MSA and gene tree estimation error can have first-order effects on the accuracy of downstream network reconstruction and, to a lesser extent, its computational runtime. The effects become more pronounced on more challenging datasets with greater evolutionary divergence and more sampled taxa. Our findings highlight an important need for computational methods development: namely, scalable methods are needed to account for estimated MSA and gene tree error when reconstructing phylogenetic networks using unaligned biomolecular sequence data.