{"title":"Integrating sequence composition information into microbial diversity analyses with k-mer frequency counting.","authors":"Nicholas A Bokulich","doi":"10.1128/msystems.01550-24","DOIUrl":null,"url":null,"abstract":"<p><p>k-mer frequency information in biological sequences is used for a wide range of applications, including taxonomy classification, sequence similarity estimation, and supervised learning. However, in spite of its widespread utility, k-mer counting has been largely neglected for diversity estimation. This work examines the application of k-mer counting for alpha and beta diversity as well as supervised classification from microbiome marker-gene sequencing data sets (16S rRNA gene and full-length fungal internal transcribed spacer [ITS] sequences). Results demonstrate a close correspondence with phylogenetically aware diversity metrics, and advantages for using k-mer-based metrics for measuring microbial biodiversity in microbiome sequencing surveys. k-mer counting appears to be a suitable and efficient strategy for feature processing prior to diversity estimation as well as supervised learning in microbiome surveys. This allows the incorporation of subsequence-level information into diversity estimation without the computational cost of pairwise sequence alignment. k-mer counting is proposed as a complementary approach for feature processing prior to diversity estimation and supervised learning analyses, enabling large-scale reference-free profiling of microbiomes in biogeography, ecology, and biomedical data. A method for k-mer counting from marker-gene sequence data is implemented in the QIIME 2 plugin q2-kmerizer (https://github.com/bokulich-lab/q2-kmerizer).</p><p><strong>Importance: </strong>k-mers are all of the subsequences of length k that comprise a sequence. Comparing the frequency of k-mers in DNA sequences yields valuable information about the composition of these sequences and their similarity. This work demonstrates that k-mer frequencies from marker-gene sequence surveys can be used to inform diversity estimates and machine learning predictions that incorporate sequence composition information. Alpha and beta diversity estimates based on k-mer frequencies closely correspond to phylogenetically aware diversity metrics, suggesting that k-mer-based diversity estimates are useful proxy measurements especially when reliable phylogenies are not available, as is often the case for some DNA sequence targets such as for internal transcribed spacer sequences.</p>","PeriodicalId":18819,"journal":{"name":"mSystems","volume":" ","pages":"e0155024"},"PeriodicalIF":5.0000,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"mSystems","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1128/msystems.01550-24","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
k-mer frequency information in biological sequences is used for a wide range of applications, including taxonomy classification, sequence similarity estimation, and supervised learning. However, in spite of its widespread utility, k-mer counting has been largely neglected for diversity estimation. This work examines the application of k-mer counting for alpha and beta diversity as well as supervised classification from microbiome marker-gene sequencing data sets (16S rRNA gene and full-length fungal internal transcribed spacer [ITS] sequences). Results demonstrate a close correspondence with phylogenetically aware diversity metrics, and advantages for using k-mer-based metrics for measuring microbial biodiversity in microbiome sequencing surveys. k-mer counting appears to be a suitable and efficient strategy for feature processing prior to diversity estimation as well as supervised learning in microbiome surveys. This allows the incorporation of subsequence-level information into diversity estimation without the computational cost of pairwise sequence alignment. k-mer counting is proposed as a complementary approach for feature processing prior to diversity estimation and supervised learning analyses, enabling large-scale reference-free profiling of microbiomes in biogeography, ecology, and biomedical data. A method for k-mer counting from marker-gene sequence data is implemented in the QIIME 2 plugin q2-kmerizer (https://github.com/bokulich-lab/q2-kmerizer).
Importance: k-mers are all of the subsequences of length k that comprise a sequence. Comparing the frequency of k-mers in DNA sequences yields valuable information about the composition of these sequences and their similarity. This work demonstrates that k-mer frequencies from marker-gene sequence surveys can be used to inform diversity estimates and machine learning predictions that incorporate sequence composition information. Alpha and beta diversity estimates based on k-mer frequencies closely correspond to phylogenetically aware diversity metrics, suggesting that k-mer-based diversity estimates are useful proxy measurements especially when reliable phylogenies are not available, as is often the case for some DNA sequence targets such as for internal transcribed spacer sequences.
mSystemsBiochemistry, Genetics and Molecular Biology-Biochemistry
CiteScore
10.50
自引率
3.10%
发文量
308
审稿时长
13 weeks
期刊介绍:
mSystems™ will publish preeminent work that stems from applying technologies for high-throughput analyses to achieve insights into the metabolic and regulatory systems at the scale of both the single cell and microbial communities. The scope of mSystems™ encompasses all important biological and biochemical findings drawn from analyses of large data sets, as well as new computational approaches for deriving these insights. mSystems™ will welcome submissions from researchers who focus on the microbiome, genomics, metagenomics, transcriptomics, metabolomics, proteomics, glycomics, bioinformatics, and computational microbiology. mSystems™ will provide streamlined decisions, while carrying on ASM''s tradition of rigorous peer review.