Advancing microbial diagnostics: a universal phylogeny guided computational algorithm to find unique sequences for precise microorganism detection.

IF 7.7 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS Briefings in bioinformatics Pub Date : 2024-09-23 DOI:10.1093/bib/bbae545

Gulshan Kumar Sharma, Rakesh Sharma, Kavita Joshi, Sameer Qureshi, Shubhita Mathur, Sharad Sinha, Samit Chatterjee, Vandana Nunia

{"title":"Advancing microbial diagnostics: a universal phylogeny guided computational algorithm to find unique sequences for precise microorganism detection.","authors":"Gulshan Kumar Sharma, Rakesh Sharma, Kavita Joshi, Sameer Qureshi, Shubhita Mathur, Sharad Sinha, Samit Chatterjee, Vandana Nunia","doi":"10.1093/bib/bbae545","DOIUrl":null,"url":null,"abstract":"<p><p>Sequences derived from organisms sharing common evolutionary origins exhibit similarity, while unique sequences, absent in related organisms, act as good diagnostic marker candidates. However, the approach focused on identifying dissimilar regions among closely-related organisms poses challenges as it requires complex multiple sequence alignments, making computation and parsing difficult. To address this, we have developed a biologically inspired universal NAUniSeq algorithm to find the unique sequences for microorganism diagnosis by traveling through the phylogeny of life. Mapping through a phylogenetic tree ensures a low number of cross-contamination and false positives. We have downloaded complete taxonomy data from Taxadb database and sequence data from National Center for Biotechnology Information Reference Sequence Database (NCBI-Refseq) and, with the help of NetworkX, created a phylogenetic tree. Sequences were assigned over the graph nodes, k-mers were created for target and non-target nodes and search was performed over the graph using the depth first search algorithm. In a memory efficient alternative NoSQL approach, we created a collection of Refseq sequences in MongoDB database using tax-id and path of FASTA files. We queried the MongoDB collection for the target and non-target sequences. In both the approaches, we used an alignment free sliding window k-mer-based procedure that quickly compares k-mers of target and non-target sequences and returns unique sequences that are not present in the non-target. We have validated our algorithm with target nodes Mycobacterium tuberculosis, Neisseria gonorrhoeae, and Monkeypox and generated unique sequences. This universal algorithm is a powerful tool for generating diagnostic sequences, enabling the accurate identification of microbial strains with high phylogenetic precision.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":7.7000,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11497845/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bib/bbae545","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Sequences derived from organisms sharing common evolutionary origins exhibit similarity, while unique sequences, absent in related organisms, act as good diagnostic marker candidates. However, the approach focused on identifying dissimilar regions among closely-related organisms poses challenges as it requires complex multiple sequence alignments, making computation and parsing difficult. To address this, we have developed a biologically inspired universal NAUniSeq algorithm to find the unique sequences for microorganism diagnosis by traveling through the phylogeny of life. Mapping through a phylogenetic tree ensures a low number of cross-contamination and false positives. We have downloaded complete taxonomy data from Taxadb database and sequence data from National Center for Biotechnology Information Reference Sequence Database (NCBI-Refseq) and, with the help of NetworkX, created a phylogenetic tree. Sequences were assigned over the graph nodes, k-mers were created for target and non-target nodes and search was performed over the graph using the depth first search algorithm. In a memory efficient alternative NoSQL approach, we created a collection of Refseq sequences in MongoDB database using tax-id and path of FASTA files. We queried the MongoDB collection for the target and non-target sequences. In both the approaches, we used an alignment free sliding window k-mer-based procedure that quickly compares k-mers of target and non-target sequences and returns unique sequences that are not present in the non-target. We have validated our algorithm with target nodes Mycobacterium tuberculosis, Neisseria gonorrhoeae, and Monkeypox and generated unique sequences. This universal algorithm is a powerful tool for generating diagnostic sequences, enabling the accurate identification of microbial strains with high phylogenetic precision.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

推进微生物诊断：以通用系统发育为指导的计算算法，为精确检测微生物寻找独特的序列。

从具有共同进化起源的生物体中提取的序列具有相似性，而在相关生物体中不存在的独特序列则是良好的诊断标记候选物。然而，这种侧重于识别近缘生物中不相似区域的方法需要复杂的多序列比对，给计算和解析带来了困难，因此带来了挑战。为了解决这个问题，我们开发了一种受生物学启发的通用 NAUniSeq 算法，通过在生命系统发育过程中旅行，找到用于微生物诊断的独特序列。通过系统发生树进行映射可确保较低的交叉污染和假阳性率。我们从 Taxadb 数据库下载了完整的分类数据，从美国国家生物技术信息中心参考序列数据库（NCBI-Refseq）下载了序列数据，并在 NetworkX 的帮助下创建了系统发生树。在图节点上分配序列，为目标和非目标节点创建 k-mers，并使用深度优先搜索算法在图上进行搜索。在一种内存高效的替代 NoSQL 方法中，我们使用 FASTA 文件的税号和路径在 MongoDB 数据库中创建了 Refseq 序列集合。我们在 MongoDB 数据库中查询目标和非目标序列。在这两种方法中，我们都使用了基于无配对滑动窗口 k-mer的程序，该程序可快速比较目标序列和非目标序列的 k-mer，并返回非目标序列中不存在的唯一序列。我们用结核分枝杆菌、淋病奈瑟菌和猴痘等目标节点验证了我们的算法，并生成了独特的序列。这种通用算法是生成诊断序列的强大工具，可准确鉴定微生物菌株，并具有很高的系统发育精确度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Briefings in bioinformatics 生物-生化研究方法

CiteScore

13.20

自引率

13.70%

发文量

549

审稿时长

6 months

期刊介绍： Briefings in Bioinformatics is an international journal serving as a platform for researchers and educators in the life sciences. It also appeals to mathematicians, statisticians, and computer scientists applying their expertise to biological challenges. The journal focuses on reviews tailored for users of databases and analytical tools in contemporary genetics, molecular and systems biology. It stands out by offering practical assistance and guidance to non-specialists in computerized methodologies. Covering a wide range from introductory concepts to specific protocols and analyses, the papers address bacterial, plant, fungal, animal, and human data. The journal's detailed subject areas include genetic studies of phenotypes and genotypes, mapping, DNA sequencing, expression profiling, gene expression studies, microarrays, alignment methods, protein profiles and HMMs, lipids, metabolic and signaling pathways, structure determination and function prediction, phylogenetic studies, and education and training.