{"title":"Binning metagenomic reads with probabilistic sequence signatures based on spaced seeds","authors":"Samuele Girotto, M. Comin, Cinzia Pizzi","doi":"10.1109/CIBCB.2017.8058538","DOIUrl":null,"url":null,"abstract":"The growing number of sequencing projects in medicine and environmental sciences calls for the development of efficient approaches for the analysis of very large sets of metagenomic reads. Among the challenging tasks in metagenomics, the ability to agglomerate, or “bin” together, reads of the same species, without reference genomes, plays a crucial role in building a comprehensive description of relative abundances and diversity of the species in the sample. Recently, we have proposed an algorithm, called MetaProb, for metagenomic reads binning that reaches a precision that is currently unmatched. The competitive advantage of MetaProb depends on the use of probabilistic sequence signatures based on contiguous fc-mers. In this work we explore the use of spaced seeds, rather than contiguous kmers, to build such signatures. The experimental results show that allowing mismatches in carefully chosen predefined positions leads to further benefits both in terms of improved accuracy and of reduction of the memory requirements. Availability: https://bitbucket.org/samu661/metaprob.","PeriodicalId":283115,"journal":{"name":"2017 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIBCB.2017.8058538","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
The growing number of sequencing projects in medicine and environmental sciences calls for the development of efficient approaches for the analysis of very large sets of metagenomic reads. Among the challenging tasks in metagenomics, the ability to agglomerate, or “bin” together, reads of the same species, without reference genomes, plays a crucial role in building a comprehensive description of relative abundances and diversity of the species in the sample. Recently, we have proposed an algorithm, called MetaProb, for metagenomic reads binning that reaches a precision that is currently unmatched. The competitive advantage of MetaProb depends on the use of probabilistic sequence signatures based on contiguous fc-mers. In this work we explore the use of spaced seeds, rather than contiguous kmers, to build such signatures. The experimental results show that allowing mismatches in carefully chosen predefined positions leads to further benefits both in terms of improved accuracy and of reduction of the memory requirements. Availability: https://bitbucket.org/samu661/metaprob.