{"title":"Disambiguating a Soft Metagenomic Clustering.","authors":"Rahul Nihalani, Jaroslaw Zola, Srinivas Aluru","doi":"10.1089/cmb.2024.0825","DOIUrl":null,"url":null,"abstract":"<p><p>Clustering is a popular technique used for analyzing amplicon sequencing data in metagenomics. Specifically, it is used to assign sequences (<i>reads</i>) to clusters, each cluster representing a species or a higher level taxonomic unit. Reads from multiple species often sharing subsequences, combined with lack of a perfect similarity measure, make it difficult to correctly assign reads to clusters. Thus, metagenomic clustering methods must either resort to ambiguity, or make the best available choice at each read assignment stage, which could lead to incorrect clusters and potentially cascading errors. In this article, we argue for first generating an ambiguous clustering and then resolving the ambiguities collectively by analyzing the ambiguous clusters. We propose a rigorous formulation of this problem and show that it is <i>NP</i>-Hard. We then propose an efficient heuristic to solve it in practice. We validate our approach on several synthetically generated datasets and two datasets consisting of 16S rDNA sequences from the microbiome of rat guts.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4000,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1089/cmb.2024.0825","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Clustering is a popular technique used for analyzing amplicon sequencing data in metagenomics. Specifically, it is used to assign sequences (reads) to clusters, each cluster representing a species or a higher level taxonomic unit. Reads from multiple species often sharing subsequences, combined with lack of a perfect similarity measure, make it difficult to correctly assign reads to clusters. Thus, metagenomic clustering methods must either resort to ambiguity, or make the best available choice at each read assignment stage, which could lead to incorrect clusters and potentially cascading errors. In this article, we argue for first generating an ambiguous clustering and then resolving the ambiguities collectively by analyzing the ambiguous clusters. We propose a rigorous formulation of this problem and show that it is NP-Hard. We then propose an efficient heuristic to solve it in practice. We validate our approach on several synthetically generated datasets and two datasets consisting of 16S rDNA sequences from the microbiome of rat guts.
期刊介绍:
Journal of Computational Biology is the leading peer-reviewed journal in computational biology and bioinformatics, publishing in-depth statistical, mathematical, and computational analysis of methods, as well as their practical impact. Available only online, this is an essential journal for scientists and students who want to keep abreast of developments in bioinformatics.
Journal of Computational Biology coverage includes:
-Genomics
-Mathematical modeling and simulation
-Distributed and parallel biological computing
-Designing biological databases
-Pattern matching and pattern detection
-Linking disparate databases and data
-New tools for computational biology
-Relational and object-oriented database technology for bioinformatics
-Biological expert system design and use
-Reasoning by analogy, hypothesis formation, and testing by machine
-Management of biological databases