Niklas Birth, Nicolina Leppich, Julia Schirmacher, Nina Andreae, Rasmus Steinkamp, Matthias Blanke, Peter Meinicke
{"title":"CoCoPyE: feature engineering for learning and prediction of genome quality indices.","authors":"Niklas Birth, Nicolina Leppich, Julia Schirmacher, Nina Andreae, Rasmus Steinkamp, Matthias Blanke, Peter Meinicke","doi":"10.1093/gigascience/giae079","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The exploration of the microbial world has been greatly advanced by the reconstruction of genomes from metagenomic sequence data. However, the rapidly increasing number of metagenome-assembled genomes has also resulted in a wide variation in data quality. It is therefore essential to quantify the achieved completeness and possible contamination of a reconstructed genome before it is used in subsequent analyses. The classical approach for the estimation of quality indices solely relies on a relatively small number of universal single-copy genes. Recent tools try to extend the genomic coverage of estimates for an increased accuracy.</p><p><strong>Results: </strong>We developed CoCoPyE, a fast tool based on a novel 2-stage feature extraction and transformation scheme. First, it identifies genomic markers and then refines the marker-based estimates with a machine learning approach. In our simulation studies, CoCoPyE showed a more accurate prediction of quality indices than the existing tools. While the CoCoPyE web server offers an easy way to try out the tool, the freely available Python implementation enables integration into existing genome reconstruction pipelines.</p><p><strong>Conclusions: </strong>CoCoPyE provides a new approach to assess the quality of genome data. It complements and improves existing tools and may help researchers to better distinguish between low-quality draft and high-quality genome assemblies in metagenome sequencing projects.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8000,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11503480/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giae079","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Background: The exploration of the microbial world has been greatly advanced by the reconstruction of genomes from metagenomic sequence data. However, the rapidly increasing number of metagenome-assembled genomes has also resulted in a wide variation in data quality. It is therefore essential to quantify the achieved completeness and possible contamination of a reconstructed genome before it is used in subsequent analyses. The classical approach for the estimation of quality indices solely relies on a relatively small number of universal single-copy genes. Recent tools try to extend the genomic coverage of estimates for an increased accuracy.
Results: We developed CoCoPyE, a fast tool based on a novel 2-stage feature extraction and transformation scheme. First, it identifies genomic markers and then refines the marker-based estimates with a machine learning approach. In our simulation studies, CoCoPyE showed a more accurate prediction of quality indices than the existing tools. While the CoCoPyE web server offers an easy way to try out the tool, the freely available Python implementation enables integration into existing genome reconstruction pipelines.
Conclusions: CoCoPyE provides a new approach to assess the quality of genome data. It complements and improves existing tools and may help researchers to better distinguish between low-quality draft and high-quality genome assemblies in metagenome sequencing projects.
期刊介绍:
GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.