DFAST_QC: quality assessment and taxonomic identification tool for prokaryotic Genomes.

IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS BMC Bioinformatics Pub Date : 2025-01-07 DOI:10.1186/s12859-024-06030-y
Mohamed Elmanzalawi, Takatomo Fujisawa, Hiroshi Mori, Yasukazu Nakamura, Yasuhiro Tanizawa
{"title":"DFAST_QC: quality assessment and taxonomic identification tool for prokaryotic Genomes.","authors":"Mohamed Elmanzalawi, Takatomo Fujisawa, Hiroshi Mori, Yasukazu Nakamura, Yasuhiro Tanizawa","doi":"10.1186/s12859-024-06030-y","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Accurate taxonomic classification in genome databases is essential for reliable biological research and effective data sharing. Mislabeling or inaccuracies in genome annotations can lead to incorrect scientific conclusions and hinder the reproducibility of research findings. Despite advances in genome analysis techniques, challenges persist in ensuring precise and reliable taxonomic assignments. Existing tools for genome verification often involve extensive computational resources or lengthy processing times, which can limit their accessibility and scalability for large-scale projects. There is a need for more efficient, user-friendly solutions that can handle diverse datasets and provide accurate results with minimal computational demands. This work aimed to address these challenges by introducing a novel tool that enhances taxonomic accuracy, offers a user-friendly interface, and supports large-scale analyses.</p><p><strong>Results: </strong>We introduce a novel tool for the quality control and taxonomic classification tool of prokaryotic genomes, called DFAST_QC, which is available as both a command-line tool and a web service. DFAST_QC can quickly identify species based on NCBI and GTDB taxonomies by combining genome-distance calculations using MASH with ANI calculations using Skani. We evaluated DFAST_QC's performance in species identification and found it to be highly consistent with existing taxonomic standards, successfully identifying species across diverse datasets. In several cases, DFAST_QC identified potential mislabeling of species names in public databases and highlighted discrepancies in current classifications, demonstrating its capability to uncover errors and enhance taxonomic accuracy. Additionally, the tool's efficient design allows it to operate smoothly on local machines with minimal computational requirements, making it a practical choice for large-scale genome projects.</p><p><strong>Conclusions: </strong>DFAST_QC is a reliable and efficient tool for accurate taxonomic identification and genome quality control, well-suited for large-scale genomic studies. Its compatibility with limited-resource environments, combined with its user-friendly design, ensures seamless integration into existing workflows. DFAST_QC's ability to refine species assignments in public databases highlights its value as a complementary tool for maintaining and enhancing the accuracy of taxonomic data in genomic research. The web version is available at https://dfast.ddbj.nig.ac.jp/dqc/submit/ , and the source code for local use can be found at https://github.com/nigyta/dfast_qc .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"3"},"PeriodicalIF":2.9000,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11705978/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-024-06030-y","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Accurate taxonomic classification in genome databases is essential for reliable biological research and effective data sharing. Mislabeling or inaccuracies in genome annotations can lead to incorrect scientific conclusions and hinder the reproducibility of research findings. Despite advances in genome analysis techniques, challenges persist in ensuring precise and reliable taxonomic assignments. Existing tools for genome verification often involve extensive computational resources or lengthy processing times, which can limit their accessibility and scalability for large-scale projects. There is a need for more efficient, user-friendly solutions that can handle diverse datasets and provide accurate results with minimal computational demands. This work aimed to address these challenges by introducing a novel tool that enhances taxonomic accuracy, offers a user-friendly interface, and supports large-scale analyses.

Results: We introduce a novel tool for the quality control and taxonomic classification tool of prokaryotic genomes, called DFAST_QC, which is available as both a command-line tool and a web service. DFAST_QC can quickly identify species based on NCBI and GTDB taxonomies by combining genome-distance calculations using MASH with ANI calculations using Skani. We evaluated DFAST_QC's performance in species identification and found it to be highly consistent with existing taxonomic standards, successfully identifying species across diverse datasets. In several cases, DFAST_QC identified potential mislabeling of species names in public databases and highlighted discrepancies in current classifications, demonstrating its capability to uncover errors and enhance taxonomic accuracy. Additionally, the tool's efficient design allows it to operate smoothly on local machines with minimal computational requirements, making it a practical choice for large-scale genome projects.

Conclusions: DFAST_QC is a reliable and efficient tool for accurate taxonomic identification and genome quality control, well-suited for large-scale genomic studies. Its compatibility with limited-resource environments, combined with its user-friendly design, ensures seamless integration into existing workflows. DFAST_QC's ability to refine species assignments in public databases highlights its value as a complementary tool for maintaining and enhancing the accuracy of taxonomic data in genomic research. The web version is available at https://dfast.ddbj.nig.ac.jp/dqc/submit/ , and the source code for local use can be found at https://github.com/nigyta/dfast_qc .

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
DFAST_QC:原核生物基因组质量评估和分类鉴定工具。
背景:基因组数据库中准确的分类对可靠的生物学研究和有效的数据共享至关重要。基因组注释中的错误标记或不准确可能导致不正确的科学结论,并阻碍研究结果的可重复性。尽管基因组分析技术取得了进步,但在确保精确可靠的分类分配方面仍然存在挑战。现有的基因组验证工具通常涉及大量的计算资源或冗长的处理时间,这限制了它们在大型项目中的可访问性和可扩展性。我们需要更高效、用户友好的解决方案,能够处理不同的数据集,并以最小的计算需求提供准确的结果。这项工作旨在通过引入一种新的工具来解决这些挑战,该工具可以提高分类准确性,提供用户友好的界面,并支持大规模分析。结果:我们介绍了一种新的原核生物基因组质量控制和分类工具DFAST_QC,它可以作为命令行工具和web服务。DFAST_QC通过结合使用MASH的基因组距离计算和使用Skani的ANI计算,可以快速识别基于NCBI和GTDB分类的物种。我们评估了DFAST_QC在物种识别方面的表现,发现它与现有的分类学标准高度一致,成功地识别了不同数据集的物种。在一些案例中,DFAST_QC发现了公共数据库中潜在的物种名称错误标记,并突出了当前分类中的差异,证明了其发现错误和提高分类准确性的能力。此外,该工具的高效设计使其能够在本地机器上以最小的计算需求顺利运行,使其成为大规模基因组项目的实用选择。结论:DFAST_QC是一种可靠、高效的准确分类鉴定和基因组质量控制工具,适用于大规模基因组研究。它与有限资源环境的兼容性,结合其用户友好的设计,确保无缝集成到现有的工作流程。DFAST_QC在公共数据库中完善物种分配的能力突出了其作为维护和提高基因组研究中分类数据准确性的补充工具的价值。web版本可在https://dfast.ddbj.nig.ac.jp/dqc/submit/上获得,本地使用的源代码可在https://github.com/nigyta/dfast_qc上找到。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
BMC Bioinformatics
BMC Bioinformatics 生物-生化研究方法
CiteScore
5.70
自引率
3.30%
发文量
506
审稿时长
4.3 months
期刊介绍: BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.
期刊最新文献
AMEND 2.0: module identification and multi-omic data integration with multiplex-heterogeneous graphs. BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouse. CellMAP: an open-source software tool to batch-process cell topography and stiffness maps collected with an atomic force microscope. Accurate assembly of full-length consensus for viral quasispecies. Flexible analysis of spatial transcriptomics data (FAST): a deconvolution approach.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1