{"title":"Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis.","authors":"Huiguang Yi, Yanling Lin, Chengqi Lin, Wenfei Jin","doi":"10.1186/s13059-021-02303-4","DOIUrl":null,"url":null,"abstract":"<p><p>Here, we develop k -mer substring space decomposition (Kssd), a sketching technique which is significantly faster and more accurate than current sketching methods. We show that it is the only method that can be used for large-scale dataset comparisons at population resolution on simulated and real data. Using Kssd, we prioritize references for all 1,019,179 bacteria whole genome sequencing (WGS) runs from NCBI Sequence Read Archive and find misidentification or contamination in 6164 of these. Additionally, we analyze WGS and exome runs of samples from the 1000 Genomes Project.</p>","PeriodicalId":48922,"journal":{"name":"Genome Biology","volume":null,"pages":null},"PeriodicalIF":12.3000,"publicationDate":"2021-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7962209/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13059-021-02303-4","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Agricultural and Biological Sciences","Score":null,"Total":0}
引用次数: 0
Abstract
Here, we develop k -mer substring space decomposition (Kssd), a sketching technique which is significantly faster and more accurate than current sketching methods. We show that it is the only method that can be used for large-scale dataset comparisons at population resolution on simulated and real data. Using Kssd, we prioritize references for all 1,019,179 bacteria whole genome sequencing (WGS) runs from NCBI Sequence Read Archive and find misidentification or contamination in 6164 of these. Additionally, we analyze WGS and exome runs of samples from the 1000 Genomes Project.
期刊介绍:
Genome Biology is a leading research journal that focuses on the study of biology and biomedicine from a genomic and post-genomic standpoint. The journal consistently publishes outstanding research across various areas within these fields.
With an impressive impact factor of 12.3 (2022), Genome Biology has earned its place as the 3rd highest-ranked research journal in the Genetics and Heredity category, according to Thomson Reuters. Additionally, it is ranked 2nd among research journals in the Biotechnology and Applied Microbiology category. It is important to note that Genome Biology is the top-ranking open access journal in this category.
In summary, Genome Biology sets a high standard for scientific publications in the field, showcasing cutting-edge research and earning recognition among its peers.