The impact of gene sequence alignment and gene tree estimation error on summary-based species network estimation

Meijun Gao, Wei Wang, Kevin J. Liu
{"title":"The impact of gene sequence alignment and gene tree estimation error on summary-based species network estimation","authors":"Meijun Gao, Wei Wang, Kevin J. Liu","doi":"10.1145/3535508.3545559","DOIUrl":null,"url":null,"abstract":"Thanks in part to rapid advances in next-generation sequencing technologies, recent phylogenomic studies have demonstrated the pivotal role that non-tree-like evolution plays in many parts of the Tree of Life - the evolutionary history of all life on Earth. As such, the Tree of Life is not necessarily a tree at all, but is better described by more general graph structures such as a phylogenetic network. Another key ingredient in these advances consists of the computational methods needed for reconstructing phylogenetic networks from large-scale genomic sequence data. But virtually all of these methods either require multiple sequence alignments (MSAs) as input or utilize gene trees or other inputs that are computed using MSAs. All of the input MSAs and gene trees must be estimated on empirical data. The methods themselves do not directly account for upstream estimation error, and, apart from prior studies of phylogenetic tree reconstruction and anecdotal evidence, little is understood about the impact of estimated MSA and gene tree error on downstream species network reconstruction. We therefore undertake a performance study to quantify the impact of MSA error and gene tree error on state-of-the-art phylogenetic network inference methods. Our study utilizes synthetic benchmarking data as well as genomic sequence data from mosquito and yeast. We find that upstream MSA and gene tree estimation error can have first-order effects on the accuracy of downstream network reconstruction and, to a lesser extent, its computational runtime. The effects become more pronounced on more challenging datasets with greater evolutionary divergence and more sampled taxa. Our findings highlight an important need for computational methods development: namely, scalable methods are needed to account for estimated MSA and gene tree error when reconstructing phylogenetic networks using unaligned biomolecular sequence data.","PeriodicalId":354504,"journal":{"name":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3535508.3545559","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Thanks in part to rapid advances in next-generation sequencing technologies, recent phylogenomic studies have demonstrated the pivotal role that non-tree-like evolution plays in many parts of the Tree of Life - the evolutionary history of all life on Earth. As such, the Tree of Life is not necessarily a tree at all, but is better described by more general graph structures such as a phylogenetic network. Another key ingredient in these advances consists of the computational methods needed for reconstructing phylogenetic networks from large-scale genomic sequence data. But virtually all of these methods either require multiple sequence alignments (MSAs) as input or utilize gene trees or other inputs that are computed using MSAs. All of the input MSAs and gene trees must be estimated on empirical data. The methods themselves do not directly account for upstream estimation error, and, apart from prior studies of phylogenetic tree reconstruction and anecdotal evidence, little is understood about the impact of estimated MSA and gene tree error on downstream species network reconstruction. We therefore undertake a performance study to quantify the impact of MSA error and gene tree error on state-of-the-art phylogenetic network inference methods. Our study utilizes synthetic benchmarking data as well as genomic sequence data from mosquito and yeast. We find that upstream MSA and gene tree estimation error can have first-order effects on the accuracy of downstream network reconstruction and, to a lesser extent, its computational runtime. The effects become more pronounced on more challenging datasets with greater evolutionary divergence and more sampled taxa. Our findings highlight an important need for computational methods development: namely, scalable methods are needed to account for estimated MSA and gene tree error when reconstructing phylogenetic networks using unaligned biomolecular sequence data.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基因序列比对和基因树估计误差对基于摘要的物种网络估计的影响
部分得益于下一代测序技术的快速发展,最近的系统基因组学研究已经证明了非树状进化在生命之树(地球上所有生命的进化史)的许多部分中起着关键作用。因此,生命之树不一定是树,而是用更一般的图结构(如系统发育网络)来更好地描述。这些进步的另一个关键因素包括从大规模基因组序列数据重建系统发育网络所需的计算方法。但实际上所有这些方法要么需要多序列比对(msa)作为输入,要么利用基因树或使用msa计算的其他输入。所有输入的msa和基因树都必须根据经验数据进行估计。这些方法本身并不能直接解释上游估计误差,而且,除了之前的系统发育树重建研究和轶事证据外,人们对估计的MSA和基因树误差对下游物种网络重建的影响知之甚少。因此,我们进行了一项性能研究,以量化MSA误差和基因树误差对最先进的系统发育网络推断方法的影响。我们的研究利用了合成基准数据以及来自蚊子和酵母的基因组序列数据。研究发现,上游MSA和基因树估计误差对下游网络重建的精度有一阶影响,并在较小程度上影响其计算运行时间。在具有更大进化差异和更多样本分类群的更具挑战性的数据集上,这种影响变得更加明显。我们的研究结果强调了计算方法发展的一个重要需求:即,在使用未对齐的生物分子序列数据重建系统发育网络时,需要可扩展的方法来考虑估计的MSA和基因树误差。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Examining post-pandemic behaviors influencing human mobility trends Geographic ensembles of observations using randomised ensembles of autoregression chains: ensemble methods for spatio-temporal time series forecasting of influenza-like illness Trajectory-based and sound-based medical data clustering Session details: Graphs & networks TopographyNET
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1