{"title":"HiFiBGC: an ensemble approach for improved biosynthetic gene cluster detection in PacBio HiFi-read metagenomes.","authors":"Amit Yadav, Srikrishna Subramanian","doi":"10.1186/s12864-024-10950-7","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Microbes produce diverse bioactive natural products with applications in fields such as medicine and agriculture. In their genomes, these natural products are encoded by physically clustered genes known as biosynthetic gene clusters (BGCs). Genome and metagenome sequencing advances have enabled high-throughput identification of BGCs as a promising avenue for natural product discovery. BGC mining from (meta)genomes using in silico tools has allowed access to a vast diversity of potentially novel natural products. However, a fundamental limitation has been the ability to assemble complete BGCs, especially from complex metagenomes. With their fragmented assemblies, short-read technologies struggle to recover complete BGCs, such as the long and repetitive nonribosomal peptide synthetase (NRPS) and polyketide synthase (PKS). Recent advances in long-read sequencing, such as the High Fidelity (HiFi) technology from PacBio, have reduced this limitation and can help retrieve both accurate and complete BGCs from metagenomes, warranting improvement in the existing BGC identification approach for better utilization of HiFi data.</p><p><strong>Results: </strong>Here, we present HiFiBGC, a command-line-based workflow to identify BGCs in PacBio HiFi metagenomes. HiFiBGC leverages an ensemble of assemblies from three HiFi-tailored metagenome assemblers and the reads not represented in these assemblies. Based on our analyses of four HiFi metagenomic datasets from four different environments, we show that HiFiBGC identifies, on average, 78% more BGCs than the top-performing single-assembler-based method. This increase is due to HiFiBGC's ensemble assembly approach, which improves recovery by 25%, as well as from the inclusion of mostly fragmented BGCs identified in the unmapped reads.</p><p><strong>Conclusions: </strong>HiFiBGC is a computational workflow for identifying BGCs in long-read HiFi metagenomes, implemented majorly using Python programming language and workflow manager Snakemake. HiFiBGC is available on GitHub at https://github.com/ay-amityadav/HiFiBGC under the MIT license. The code related to the figures and analyses presented in the manuscript is available at https://github.com/ay-amityadav/HiFiBGC_analyses .</p>","PeriodicalId":9030,"journal":{"name":"BMC Genomics","volume":"25 1","pages":"1096"},"PeriodicalIF":3.5000,"publicationDate":"2024-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11569603/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12864-024-10950-7","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Microbes produce diverse bioactive natural products with applications in fields such as medicine and agriculture. In their genomes, these natural products are encoded by physically clustered genes known as biosynthetic gene clusters (BGCs). Genome and metagenome sequencing advances have enabled high-throughput identification of BGCs as a promising avenue for natural product discovery. BGC mining from (meta)genomes using in silico tools has allowed access to a vast diversity of potentially novel natural products. However, a fundamental limitation has been the ability to assemble complete BGCs, especially from complex metagenomes. With their fragmented assemblies, short-read technologies struggle to recover complete BGCs, such as the long and repetitive nonribosomal peptide synthetase (NRPS) and polyketide synthase (PKS). Recent advances in long-read sequencing, such as the High Fidelity (HiFi) technology from PacBio, have reduced this limitation and can help retrieve both accurate and complete BGCs from metagenomes, warranting improvement in the existing BGC identification approach for better utilization of HiFi data.
Results: Here, we present HiFiBGC, a command-line-based workflow to identify BGCs in PacBio HiFi metagenomes. HiFiBGC leverages an ensemble of assemblies from three HiFi-tailored metagenome assemblers and the reads not represented in these assemblies. Based on our analyses of four HiFi metagenomic datasets from four different environments, we show that HiFiBGC identifies, on average, 78% more BGCs than the top-performing single-assembler-based method. This increase is due to HiFiBGC's ensemble assembly approach, which improves recovery by 25%, as well as from the inclusion of mostly fragmented BGCs identified in the unmapped reads.
Conclusions: HiFiBGC is a computational workflow for identifying BGCs in long-read HiFi metagenomes, implemented majorly using Python programming language and workflow manager Snakemake. HiFiBGC is available on GitHub at https://github.com/ay-amityadav/HiFiBGC under the MIT license. The code related to the figures and analyses presented in the manuscript is available at https://github.com/ay-amityadav/HiFiBGC_analyses .
期刊介绍:
BMC Genomics is an open access, peer-reviewed journal that considers articles on all aspects of genome-scale analysis, functional genomics, and proteomics.
BMC Genomics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.