Integrating sequence composition information into microbial diversity analyses with k-mer frequency counting.

IF 5 2区生物学 Q1 MICROBIOLOGY mSystems Pub Date : 2025-02-20 DOI:10.1128/msystems.01550-24

Nicholas A Bokulich

{"title":"Integrating sequence composition information into microbial diversity analyses with k-mer frequency counting.","authors":"Nicholas A Bokulich","doi":"10.1128/msystems.01550-24","DOIUrl":null,"url":null,"abstract":"k-mer frequency information in biological sequences is used for a wide range of applications, including taxonomy classification, sequence similarity estimation, and supervised learning. However, in spite of its widespread utility, k-mer counting has been largely neglected for diversity estimation. This work examines the application of k-mer counting for alpha and beta diversity as well as supervised classification from microbiome marker-gene sequencing data sets (16S rRNA gene and full-length fungal internal transcribed spacer [ITS] sequences). Results demonstrate a close correspondence with phylogenetically aware diversity metrics, and advantages for using k-mer-based metrics for measuring microbial biodiversity in microbiome sequencing surveys. k-mer counting appears to be a suitable and efficient strategy for feature processing prior to diversity estimation as well as supervised learning in microbiome surveys. This allows the incorporation of subsequence-level information into diversity estimation without the computational cost of pairwise sequence alignment. k-mer counting is proposed as a complementary approach for feature processing prior to diversity estimation and supervised learning analyses, enabling large-scale reference-free profiling of microbiomes in biogeography, ecology, and biomedical data. A method for k-mer counting from marker-gene sequence data is implemented in the QIIME 2 plugin q2-kmerizer (https://github.com/bokulich-lab/q2-kmerizer).Importance: k-mers are all of the subsequences of length k that comprise a sequence. Comparing the frequency of k-mers in DNA sequences yields valuable information about the composition of these sequences and their similarity. This work demonstrates that k-mer frequencies from marker-gene sequence surveys can be used to inform diversity estimates and machine learning predictions that incorporate sequence composition information. Alpha and beta diversity estimates based on k-mer frequencies closely correspond to phylogenetically aware diversity metrics, suggesting that k-mer-based diversity estimates are useful proxy measurements especially when reliable phylogenies are not available, as is often the case for some DNA sequence targets such as for internal transcribed spacer sequences.","PeriodicalId":18819,"journal":{"name":"mSystems","volume":" ","pages":"e0155024"},"PeriodicalIF":5.0000,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"mSystems","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1128/msystems.01550-24","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MICROBIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

k-mer frequency information in biological sequences is used for a wide range of applications, including taxonomy classification, sequence similarity estimation, and supervised learning. However, in spite of its widespread utility, k-mer counting has been largely neglected for diversity estimation. This work examines the application of k-mer counting for alpha and beta diversity as well as supervised classification from microbiome marker-gene sequencing data sets (16S rRNA gene and full-length fungal internal transcribed spacer [ITS] sequences). Results demonstrate a close correspondence with phylogenetically aware diversity metrics, and advantages for using k-mer-based metrics for measuring microbial biodiversity in microbiome sequencing surveys. k-mer counting appears to be a suitable and efficient strategy for feature processing prior to diversity estimation as well as supervised learning in microbiome surveys. This allows the incorporation of subsequence-level information into diversity estimation without the computational cost of pairwise sequence alignment. k-mer counting is proposed as a complementary approach for feature processing prior to diversity estimation and supervised learning analyses, enabling large-scale reference-free profiling of microbiomes in biogeography, ecology, and biomedical data. A method for k-mer counting from marker-gene sequence data is implemented in the QIIME 2 plugin q2-kmerizer (https://github.com/bokulich-lab/q2-kmerizer).

Importance: k-mers are all of the subsequences of length k that comprise a sequence. Comparing the frequency of k-mers in DNA sequences yields valuable information about the composition of these sequences and their similarity. This work demonstrates that k-mer frequencies from marker-gene sequence surveys can be used to inform diversity estimates and machine learning predictions that incorporate sequence composition information. Alpha and beta diversity estimates based on k-mer frequencies closely correspond to phylogenetically aware diversity metrics, suggesting that k-mer-based diversity estimates are useful proxy measurements especially when reliable phylogenies are not available, as is often the case for some DNA sequence targets such as for internal transcribed spacer sequences.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

mSystems Biochemistry, Genetics and Molecular Biology-Biochemistry

CiteScore

10.50

自引率

3.10%

发文量

308

审稿时长

13 weeks

期刊介绍： mSystems™ will publish preeminent work that stems from applying technologies for high-throughput analyses to achieve insights into the metabolic and regulatory systems at the scale of both the single cell and microbial communities. The scope of mSystems™ encompasses all important biological and biochemical findings drawn from analyses of large data sets, as well as new computational approaches for deriving these insights. mSystems™ will welcome submissions from researchers who focus on the microbiome, genomics, metagenomics, transcriptomics, metabolomics, proteomics, glycomics, bioinformatics, and computational microbiology. mSystems™ will provide streamlined decisions, while carrying on ASM''s tradition of rigorous peer review.