基于分支特异性标记K-Mer的哈希方法在宏基因组分类中的可扩展性如何?

IF 1.3 Q3 ENGINEERING, ELECTRICAL & ELECTRONIC Frontiers in signal processing Pub Date : 2022-07-05 DOI:10.3389/frsip.2022.842513
Melissa M. Gray, Zhengqiao Zhao, G. Rosen
{"title":"基于分支特异性标记K-Mer的哈希方法在宏基因组分类中的可扩展性如何?","authors":"Melissa M. Gray, Zhengqiao Zhao, G. Rosen","doi":"10.3389/frsip.2022.842513","DOIUrl":null,"url":null,"abstract":"Efficiently and accurately identifying which microbes are present in a biological sample is important to medicine and biology. For example, in medicine, microbe identification allows doctors to better diagnose diseases. Two questions are essential to metagenomic analysis (the analysis of a random sampling of DNA in a patient/environment sample): How to accurately identify the microbes in samples and how to efficiently update the taxonomic classifier as new microbe genomes are sequenced and added to the reference database. To investigate how classifiers change as they train on more knowledge, we made sub-databases composed of genomes that existed in past years that served as “snapshots in time” (1999–2020) of the NCBI reference genome database. We evaluated two classification methods, Kraken 2 and CLARK with these snapshots using a real, experimental metagenomic sample from a human gut. This allowed us to measure how much of a real sample could confidently classify using these methods and as the database grows. Despite not knowing the ground truth, we could measure the concordance between methods and between years of the database within each method using a Bray-Curtis distance. In addition, we also recorded the training times of the classifiers for each snapshot. For all data for Kraken 2, we observed that as more genomes were added, more microbes from the sample were classified. CLARK had a similar trend, but in the final year, this trend reversed with the microbial variation and less unique k-mers. Also, both classifiers, while having different ways of training, generally are linear in time - but Kraken 2 has a significantly lower slope in scaling to more data.","PeriodicalId":93557,"journal":{"name":"Frontiers in signal processing","volume":"12 1","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2022-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"How Scalable Are Clade-Specific Marker K-Mer Based Hash Methods for Metagenomic Taxonomic Classification?\",\"authors\":\"Melissa M. Gray, Zhengqiao Zhao, G. Rosen\",\"doi\":\"10.3389/frsip.2022.842513\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Efficiently and accurately identifying which microbes are present in a biological sample is important to medicine and biology. For example, in medicine, microbe identification allows doctors to better diagnose diseases. Two questions are essential to metagenomic analysis (the analysis of a random sampling of DNA in a patient/environment sample): How to accurately identify the microbes in samples and how to efficiently update the taxonomic classifier as new microbe genomes are sequenced and added to the reference database. To investigate how classifiers change as they train on more knowledge, we made sub-databases composed of genomes that existed in past years that served as “snapshots in time” (1999–2020) of the NCBI reference genome database. We evaluated two classification methods, Kraken 2 and CLARK with these snapshots using a real, experimental metagenomic sample from a human gut. This allowed us to measure how much of a real sample could confidently classify using these methods and as the database grows. Despite not knowing the ground truth, we could measure the concordance between methods and between years of the database within each method using a Bray-Curtis distance. In addition, we also recorded the training times of the classifiers for each snapshot. For all data for Kraken 2, we observed that as more genomes were added, more microbes from the sample were classified. CLARK had a similar trend, but in the final year, this trend reversed with the microbial variation and less unique k-mers. Also, both classifiers, while having different ways of training, generally are linear in time - but Kraken 2 has a significantly lower slope in scaling to more data.\",\"PeriodicalId\":93557,\"journal\":{\"name\":\"Frontiers in signal processing\",\"volume\":\"12 1\",\"pages\":\"\"},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2022-07-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in signal processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3389/frsip.2022.842513\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in signal processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frsip.2022.842513","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

摘要

有效和准确地识别生物样品中存在的微生物对医学和生物学都很重要。例如,在医学上,微生物鉴定使医生能够更好地诊断疾病。宏基因组分析(从患者/环境样本中随机取样DNA的分析)有两个关键问题:如何准确识别样本中的微生物,以及如何在新的微生物基因组测序并添加到参考数据库时有效地更新分类分类器。为了研究分类器在接受更多知识训练时是如何变化的,我们制作了由过去几年存在的基因组组成的子数据库,作为NCBI参考基因组数据库的“快照”(1999-2020)。我们评估了两种分类方法,Kraken 2和CLARK,使用这些快照使用来自人类肠道的真实实验性宏基因组样本。这使我们能够测量使用这些方法和随着数据库的增长,真实样本中有多少可以自信地分类。尽管不知道实际情况,但我们可以使用布雷-柯蒂斯距离测量方法之间的一致性以及每种方法中数据库年份之间的一致性。此外,我们还记录了每个快照的分类器的训练次数。对于Kraken 2的所有数据,我们观察到,随着更多的基因组被添加,更多的样本微生物被分类。CLARK也有类似的趋势,但在最后一年,这种趋势随着微生物的变化和较少独特的k-mers而逆转。此外,这两个分类器虽然有不同的训练方式,但通常在时间上是线性的——但Kraken 2在扩展到更多数据方面的斜率明显较低。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
How Scalable Are Clade-Specific Marker K-Mer Based Hash Methods for Metagenomic Taxonomic Classification?
Efficiently and accurately identifying which microbes are present in a biological sample is important to medicine and biology. For example, in medicine, microbe identification allows doctors to better diagnose diseases. Two questions are essential to metagenomic analysis (the analysis of a random sampling of DNA in a patient/environment sample): How to accurately identify the microbes in samples and how to efficiently update the taxonomic classifier as new microbe genomes are sequenced and added to the reference database. To investigate how classifiers change as they train on more knowledge, we made sub-databases composed of genomes that existed in past years that served as “snapshots in time” (1999–2020) of the NCBI reference genome database. We evaluated two classification methods, Kraken 2 and CLARK with these snapshots using a real, experimental metagenomic sample from a human gut. This allowed us to measure how much of a real sample could confidently classify using these methods and as the database grows. Despite not knowing the ground truth, we could measure the concordance between methods and between years of the database within each method using a Bray-Curtis distance. In addition, we also recorded the training times of the classifiers for each snapshot. For all data for Kraken 2, we observed that as more genomes were added, more microbes from the sample were classified. CLARK had a similar trend, but in the final year, this trend reversed with the microbial variation and less unique k-mers. Also, both classifiers, while having different ways of training, generally are linear in time - but Kraken 2 has a significantly lower slope in scaling to more data.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A mini-review of signal processing techniques for RIS-assisted near field THz communication Editorial: Signal processing in computational video and video streaming Editorial: Editor’s challenge—image processing Improved circuitry and post-processing for interleaved fast-scan cyclic voltammetry and electrophysiology measurements Bounds for Haralick features in synthetic images with sinusoidal gradients
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1