Reference Genome Choice and Filtering Thresholds Jointly Influence Phylogenomic Analyses.

IF 6.1 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Systematic Biology Pub Date : 2024-05-27 DOI:10.1093/sysbio/syad065
Jessica A Rick, Chad D Brock, Alexander L Lewanski, Jimena Golcher-Benavides, Catherine E Wagner
{"title":"Reference Genome Choice and Filtering Thresholds Jointly Influence Phylogenomic Analyses.","authors":"Jessica A Rick, Chad D Brock, Alexander L Lewanski, Jimena Golcher-Benavides, Catherine E Wagner","doi":"10.1093/sysbio/syad065","DOIUrl":null,"url":null,"abstract":"<p><p>Molecular phylogenies are a cornerstone of modern comparative biology and are commonly employed to investigate a range of biological phenomena, such as diversification rates, patterns in trait evolution, biogeography, and community assembly. Recent work has demonstrated that significant biases may be introduced into downstream phylogenetic analyses from processing genomic data; however, it remains unclear whether there are interactions among bioinformatic parameters or biases introduced through the choice of reference genome for sequence alignment and variant calling. We address these knowledge gaps by employing a combination of simulated and empirical data sets to investigate the extent to which the choice of reference genome in upstream bioinformatic processing of genomic data influences phylogenetic inference, as well as the way that reference genome choice interacts with bioinformatic filtering choices and phylogenetic inference method. We demonstrate that more stringent minor allele filters bias inferred trees away from the true species tree topology, and that these biased trees tend to be more imbalanced and have a higher center of gravity than the true trees. We find the greatest topological accuracy when filtering sites for minor allele count (MAC) >3-4 in our 51-taxa data sets, while tree center of gravity was closest to the true value when filtering for sites with MAC >1-2. In contrast, filtering for missing data increased accuracy in the inferred topologies; however, this effect was small in comparison to the effect of minor allele filters and may be undesirable due to a subsequent mutation spectrum distortion. The bias introduced by these filters differs based on the reference genome used in short read alignment, providing further support that choosing a reference genome for alignment is an important bioinformatic decision with implications for downstream analyses. These results demonstrate that attributes of the study system and dataset (and their interaction) add important nuance for how best to assemble and filter short-read genomic data for phylogenetic inference.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":"76-101"},"PeriodicalIF":6.1000,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Systematic Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/sysbio/syad065","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EVOLUTIONARY BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Molecular phylogenies are a cornerstone of modern comparative biology and are commonly employed to investigate a range of biological phenomena, such as diversification rates, patterns in trait evolution, biogeography, and community assembly. Recent work has demonstrated that significant biases may be introduced into downstream phylogenetic analyses from processing genomic data; however, it remains unclear whether there are interactions among bioinformatic parameters or biases introduced through the choice of reference genome for sequence alignment and variant calling. We address these knowledge gaps by employing a combination of simulated and empirical data sets to investigate the extent to which the choice of reference genome in upstream bioinformatic processing of genomic data influences phylogenetic inference, as well as the way that reference genome choice interacts with bioinformatic filtering choices and phylogenetic inference method. We demonstrate that more stringent minor allele filters bias inferred trees away from the true species tree topology, and that these biased trees tend to be more imbalanced and have a higher center of gravity than the true trees. We find the greatest topological accuracy when filtering sites for minor allele count (MAC) >3-4 in our 51-taxa data sets, while tree center of gravity was closest to the true value when filtering for sites with MAC >1-2. In contrast, filtering for missing data increased accuracy in the inferred topologies; however, this effect was small in comparison to the effect of minor allele filters and may be undesirable due to a subsequent mutation spectrum distortion. The bias introduced by these filters differs based on the reference genome used in short read alignment, providing further support that choosing a reference genome for alignment is an important bioinformatic decision with implications for downstream analyses. These results demonstrate that attributes of the study system and dataset (and their interaction) add important nuance for how best to assemble and filter short-read genomic data for phylogenetic inference.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
参考基因组选择和过滤阈值共同影响系统发育分析。
分子系统发育是现代比较生物学的基石,通常用于研究一系列生物现象,如多样化率、特征进化模式、生物地理学和群落聚集。最近的工作表明,处理基因组数据可能会在下游系统发育分析中引入重大偏差;然而,目前尚不清楚生物信息学参数之间是否存在相互作用,或者通过选择参考基因组进行序列比对和变体调用引入的偏差。我们通过使用模拟和经验数据集的组合来解决这些知识差距,以调查在基因组数据的上游生物信息学处理中参考基因组的选择在多大程度上影响系统发育推断,以及参考基因组选择与生物信息学过滤选择和系统发育推断方法相互作用的方式。我们证明,更严格的次要等位基因过滤了偏离真实物种树拓扑的偏差推断树,并且这些偏差树往往比真实树更不平衡,重心更高。在我们的51个分类群数据集中,当筛选次要等位基因计数>3-4的位点时,我们发现拓扑准确性最高,而当筛选次要等位基因计数>1-2的位点时树的重心最接近真实值。相反,对缺失数据的过滤提高了推断拓扑的准确性;然而,与次要等位基因过滤器的效果相比,这种效果很小,并且由于随后的突变谱畸变,可能是不希望的。这些过滤器引入的偏差因短读比对中使用的参考基因组而异,这进一步支持了选择用于比对的参考基因组是一个重要的生物信息学决策,对下游分析有影响。这些结果表明,研究系统和数据集的属性(及其相互作用)为如何最好地收集和过滤短读基因组数据以进行系统发育推断增加了重要的细微差别。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Systematic Biology
Systematic Biology 生物-进化生物学
CiteScore
13.00
自引率
7.70%
发文量
70
审稿时长
6-12 weeks
期刊介绍: Systematic Biology is the bimonthly journal of the Society of Systematic Biologists. Papers for the journal are original contributions to the theory, principles, and methods of systematics as well as phylogeny, evolution, morphology, biogeography, paleontology, genetics, and the classification of all living things. A Points of View section offers a forum for discussion, while book reviews and announcements of general interest are also featured.
期刊最新文献
A Double-edged Sword: Evolutionary Novelty along Deep-time Diversity Oscillation in An Iconic Group of Predatory Insects (Neuroptera: Mantispoidea) Are Modern Cryptic Species Detectable in the Fossil Record? A Case Study on Agamid Lizards. Bayesian Selection of Relaxed-clock Models: Distinguishing Between Independent and Autocorrelated Rates. Testing relationships between multiple regional features and biogeographic processes of speciation, extinction, and dispersal Robustness of Divergence Time Estimation Despite Gene Tree Estimation Error: A Case Study of Fireflies (Coleoptera: Lampyridae)
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1