数据库选择和置信度对Kraken2分类性能的影响

IF 4.6 4区 农林科学 Q1 BIOTECHNOLOGY & APPLIED MICROBIOLOGY aBIOTECH Pub Date : 2024-07-31 DOI:10.1007/s42994-024-00178-0
Yunlong Liu, Morteza H. Ghaffari, Tao Ma, Yan Tu
{"title":"数据库选择和置信度对Kraken2分类性能的影响","authors":"Yunlong Liu,&nbsp;Morteza H. Ghaffari,&nbsp;Tao Ma,&nbsp;Yan Tu","doi":"10.1007/s42994-024-00178-0","DOIUrl":null,"url":null,"abstract":"<div><p>Accurate taxonomic classification is essential to understanding microbial diversity and function through metagenomic sequencing. However, this task is complicated by the vast variety of microbial genomes and the computational limitations of bioinformatics tools. The aim of this study was to evaluate the impact of reference database selection and confidence score (CS) settings on the performance of Kraken2, a widely used k-mer-based metagenomic classifier. In this study, we generated simulated metagenomic datasets to systematically evaluate how the choice of reference databases, from the compact Minikraken v1 to the expansive nt- and GTDB r202, and different CS (from 0 to 1.0) affect the key performance metrics of Kraken2. These metrics include classification rate, precision, recall, F1 score, and accuracy of true versus calculated bacterial abundance estimation. Our results show that higher CS, which increases the rigor of taxonomic classification by requiring greater k-mer agreement, generally decreases the classification rate. This effect is particularly pronounced for smaller databases such as Minikraken and Standard-16, where no reads could be classified when the CS was above 0.4. In contrast, for larger databases such as Standard, nt and GTDB r202, precision and F1 scores improved significantly with increasing CS, highlighting their robustness to stringent conditions. Recovery rates were mostly stable, indicating consistent detection of species under different CS settings. Crucially, the results show that a comprehensive reference database combined with a moderate CS (0.2 or 0.4) significantly improves classification accuracy and sensitivity. This finding underscores the need for careful selection of database and CS parameters tailored to specific scientific questions and available computational resources to optimize the results of metagenomic analyses.</p></div>","PeriodicalId":53135,"journal":{"name":"aBIOTECH","volume":"5 4","pages":"465 - 475"},"PeriodicalIF":4.6000,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s42994-024-00178-0.pdf","citationCount":"0","resultStr":"{\"title\":\"Impact of database choice and confidence score on the performance of taxonomic classification using Kraken2\",\"authors\":\"Yunlong Liu,&nbsp;Morteza H. Ghaffari,&nbsp;Tao Ma,&nbsp;Yan Tu\",\"doi\":\"10.1007/s42994-024-00178-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Accurate taxonomic classification is essential to understanding microbial diversity and function through metagenomic sequencing. However, this task is complicated by the vast variety of microbial genomes and the computational limitations of bioinformatics tools. The aim of this study was to evaluate the impact of reference database selection and confidence score (CS) settings on the performance of Kraken2, a widely used k-mer-based metagenomic classifier. In this study, we generated simulated metagenomic datasets to systematically evaluate how the choice of reference databases, from the compact Minikraken v1 to the expansive nt- and GTDB r202, and different CS (from 0 to 1.0) affect the key performance metrics of Kraken2. These metrics include classification rate, precision, recall, F1 score, and accuracy of true versus calculated bacterial abundance estimation. Our results show that higher CS, which increases the rigor of taxonomic classification by requiring greater k-mer agreement, generally decreases the classification rate. This effect is particularly pronounced for smaller databases such as Minikraken and Standard-16, where no reads could be classified when the CS was above 0.4. In contrast, for larger databases such as Standard, nt and GTDB r202, precision and F1 scores improved significantly with increasing CS, highlighting their robustness to stringent conditions. Recovery rates were mostly stable, indicating consistent detection of species under different CS settings. Crucially, the results show that a comprehensive reference database combined with a moderate CS (0.2 or 0.4) significantly improves classification accuracy and sensitivity. This finding underscores the need for careful selection of database and CS parameters tailored to specific scientific questions and available computational resources to optimize the results of metagenomic analyses.</p></div>\",\"PeriodicalId\":53135,\"journal\":{\"name\":\"aBIOTECH\",\"volume\":\"5 4\",\"pages\":\"465 - 475\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2024-07-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://link.springer.com/content/pdf/10.1007/s42994-024-00178-0.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"aBIOTECH\",\"FirstCategoryId\":\"1091\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s42994-024-00178-0\",\"RegionNum\":4,\"RegionCategory\":\"农林科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOTECHNOLOGY & APPLIED MICROBIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"aBIOTECH","FirstCategoryId":"1091","ListUrlMain":"https://link.springer.com/article/10.1007/s42994-024-00178-0","RegionNum":4,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

准确的分类是通过宏基因组测序了解微生物多样性和功能的必要条件。然而,由于微生物基因组的多样性和生物信息学工具的计算限制,这项任务变得复杂。本研究的目的是评估参考数据库选择和置信度评分(CS)设置对Kraken2性能的影响,Kraken2是一种广泛使用的基于k-mer的宏基因组分类器。在这项研究中,我们生成了模拟宏基因组数据集,系统地评估了参考数据库的选择,从紧凑的Minikraken v1到扩展的nt-和GTDB r202,以及不同的CS(从0到1.0)如何影响Kraken2的关键性能指标。这些指标包括分类率、精确度、召回率、F1分数和真实的细菌丰度估计与计算的细菌丰度估计的准确性。结果表明,较高的CS要求较高的k-mer一致性,从而增加了分类的严谨性,但通常会降低分类率。这种影响在Minikraken和Standard-16等较小的数据库中尤为明显,当CS高于0.4时,没有读取可以被分类。相比之下,对于较大的数据库,如Standard, nt和GTDB r202,精度和F1分数随着CS的增加而显著提高,突出了它们对严格条件的鲁棒性。回收率基本稳定,表明在不同CS设置下检测到的物种一致。重要的是,结果表明,综合参考数据库结合中等CS(0.2或0.4)显著提高了分类精度和灵敏度。这一发现强调需要仔细选择数据库和CS参数,以针对特定的科学问题和可用的计算资源来优化宏基因组分析的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Impact of database choice and confidence score on the performance of taxonomic classification using Kraken2

Accurate taxonomic classification is essential to understanding microbial diversity and function through metagenomic sequencing. However, this task is complicated by the vast variety of microbial genomes and the computational limitations of bioinformatics tools. The aim of this study was to evaluate the impact of reference database selection and confidence score (CS) settings on the performance of Kraken2, a widely used k-mer-based metagenomic classifier. In this study, we generated simulated metagenomic datasets to systematically evaluate how the choice of reference databases, from the compact Minikraken v1 to the expansive nt- and GTDB r202, and different CS (from 0 to 1.0) affect the key performance metrics of Kraken2. These metrics include classification rate, precision, recall, F1 score, and accuracy of true versus calculated bacterial abundance estimation. Our results show that higher CS, which increases the rigor of taxonomic classification by requiring greater k-mer agreement, generally decreases the classification rate. This effect is particularly pronounced for smaller databases such as Minikraken and Standard-16, where no reads could be classified when the CS was above 0.4. In contrast, for larger databases such as Standard, nt and GTDB r202, precision and F1 scores improved significantly with increasing CS, highlighting their robustness to stringent conditions. Recovery rates were mostly stable, indicating consistent detection of species under different CS settings. Crucially, the results show that a comprehensive reference database combined with a moderate CS (0.2 or 0.4) significantly improves classification accuracy and sensitivity. This finding underscores the need for careful selection of database and CS parameters tailored to specific scientific questions and available computational resources to optimize the results of metagenomic analyses.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
7.70
自引率
2.80%
发文量
0
期刊最新文献
Development of an RNA virus vector for non-transgenic genome editing in tobacco and generation of berberine bridge enzyme-like mutants with reduced nicotine content Unlocking epigenetic breeding potential in tomato and potato Thiophanate-methyl and its major metabolite carbendazim weaken rhizobacteria-mediated defense responses in cucumbers against Fusarium wilt Correction: The RUBY reporter for visual selection in soybean genome editing Establishment of a genome‐editing system to create fragrant germplasm in sweet sorghum
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1