Advancing microbial diagnostics: a universal phylogeny guided computational algorithm to find unique sequences for precise microorganism detection.

IF 6.8 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Briefings in bioinformatics Pub Date : 2024-09-23 DOI:10.1093/bib/bbae545
Gulshan Kumar Sharma, Rakesh Sharma, Kavita Joshi, Sameer Qureshi, Shubhita Mathur, Sharad Sinha, Samit Chatterjee, Vandana Nunia
{"title":"Advancing microbial diagnostics: a universal phylogeny guided computational algorithm to find unique sequences for precise microorganism detection.","authors":"Gulshan Kumar Sharma, Rakesh Sharma, Kavita Joshi, Sameer Qureshi, Shubhita Mathur, Sharad Sinha, Samit Chatterjee, Vandana Nunia","doi":"10.1093/bib/bbae545","DOIUrl":null,"url":null,"abstract":"<p><p>Sequences derived from organisms sharing common evolutionary origins exhibit similarity, while unique sequences, absent in related organisms, act as good diagnostic marker candidates. However, the approach focused on identifying dissimilar regions among closely-related organisms poses challenges as it requires complex multiple sequence alignments, making computation and parsing difficult. To address this, we have developed a biologically inspired universal NAUniSeq algorithm to find the unique sequences for microorganism diagnosis by traveling through the phylogeny of life. Mapping through a phylogenetic tree ensures a low number of cross-contamination and false positives. We have downloaded complete taxonomy data from Taxadb database and sequence data from National Center for Biotechnology Information Reference Sequence Database (NCBI-Refseq) and, with the help of NetworkX, created a phylogenetic tree. Sequences were assigned over the graph nodes, k-mers were created for target and non-target nodes and search was performed over the graph using the depth first search algorithm. In a memory efficient alternative NoSQL approach, we created a collection of Refseq sequences in MongoDB database using tax-id and path of FASTA files. We queried the MongoDB collection for the target and non-target sequences. In both the approaches, we used an alignment free sliding window k-mer-based procedure that quickly compares k-mers of target and non-target sequences and returns unique sequences that are not present in the non-target. We have validated our algorithm with target nodes Mycobacterium tuberculosis, Neisseria gonorrhoeae, and Monkeypox and generated unique sequences. This universal algorithm is a powerful tool for generating diagnostic sequences, enabling the accurate identification of microbial strains with high phylogenetic precision.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8000,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11497845/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bib/bbae545","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Sequences derived from organisms sharing common evolutionary origins exhibit similarity, while unique sequences, absent in related organisms, act as good diagnostic marker candidates. However, the approach focused on identifying dissimilar regions among closely-related organisms poses challenges as it requires complex multiple sequence alignments, making computation and parsing difficult. To address this, we have developed a biologically inspired universal NAUniSeq algorithm to find the unique sequences for microorganism diagnosis by traveling through the phylogeny of life. Mapping through a phylogenetic tree ensures a low number of cross-contamination and false positives. We have downloaded complete taxonomy data from Taxadb database and sequence data from National Center for Biotechnology Information Reference Sequence Database (NCBI-Refseq) and, with the help of NetworkX, created a phylogenetic tree. Sequences were assigned over the graph nodes, k-mers were created for target and non-target nodes and search was performed over the graph using the depth first search algorithm. In a memory efficient alternative NoSQL approach, we created a collection of Refseq sequences in MongoDB database using tax-id and path of FASTA files. We queried the MongoDB collection for the target and non-target sequences. In both the approaches, we used an alignment free sliding window k-mer-based procedure that quickly compares k-mers of target and non-target sequences and returns unique sequences that are not present in the non-target. We have validated our algorithm with target nodes Mycobacterium tuberculosis, Neisseria gonorrhoeae, and Monkeypox and generated unique sequences. This universal algorithm is a powerful tool for generating diagnostic sequences, enabling the accurate identification of microbial strains with high phylogenetic precision.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
推进微生物诊断:以通用系统发育为指导的计算算法,为精确检测微生物寻找独特的序列。
从具有共同进化起源的生物体中提取的序列具有相似性,而在相关生物体中不存在的独特序列则是良好的诊断标记候选物。然而,这种侧重于识别近缘生物中不相似区域的方法需要复杂的多序列比对,给计算和解析带来了困难,因此带来了挑战。为了解决这个问题,我们开发了一种受生物学启发的通用 NAUniSeq 算法,通过在生命系统发育过程中旅行,找到用于微生物诊断的独特序列。通过系统发生树进行映射可确保较低的交叉污染和假阳性率。我们从 Taxadb 数据库下载了完整的分类数据,从美国国家生物技术信息中心参考序列数据库(NCBI-Refseq)下载了序列数据,并在 NetworkX 的帮助下创建了系统发生树。在图节点上分配序列,为目标和非目标节点创建 k-mers,并使用深度优先搜索算法在图上进行搜索。在一种内存高效的替代 NoSQL 方法中,我们使用 FASTA 文件的税号和路径在 MongoDB 数据库中创建了 Refseq 序列集合。我们在 MongoDB 数据库中查询目标和非目标序列。在这两种方法中,我们都使用了基于无配对滑动窗口 k-mer的程序,该程序可快速比较目标序列和非目标序列的 k-mer,并返回非目标序列中不存在的唯一序列。我们用结核分枝杆菌、淋病奈瑟菌和猴痘等目标节点验证了我们的算法,并生成了独特的序列。这种通用算法是生成诊断序列的强大工具,可准确鉴定微生物菌株,并具有很高的系统发育精确度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Briefings in bioinformatics
Briefings in bioinformatics 生物-生化研究方法
CiteScore
13.20
自引率
13.70%
发文量
549
审稿时长
6 months
期刊介绍: Briefings in Bioinformatics is an international journal serving as a platform for researchers and educators in the life sciences. It also appeals to mathematicians, statisticians, and computer scientists applying their expertise to biological challenges. The journal focuses on reviews tailored for users of databases and analytical tools in contemporary genetics, molecular and systems biology. It stands out by offering practical assistance and guidance to non-specialists in computerized methodologies. Covering a wide range from introductory concepts to specific protocols and analyses, the papers address bacterial, plant, fungal, animal, and human data. The journal's detailed subject areas include genetic studies of phenotypes and genotypes, mapping, DNA sequencing, expression profiling, gene expression studies, microarrays, alignment methods, protein profiles and HMMs, lipids, metabolic and signaling pathways, structure determination and function prediction, phylogenetic studies, and education and training.
期刊最新文献
AutoXAI4Omics: an automated explainable AI tool for omics and tabular data. MCGAE: unraveling tumor invasion through integrated multimodal spatial transcriptomics. tcrBLOSUM: an amino acid substitution matrix for sensitive alignment of distant epitope-specific TCRs. Atomistic simulations reveal impacts of missense mutations on the structure and function of SynGAP1. COFFEE: consensus single cell-type specific inference for gene regulatory networks.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1