Integrating sequence composition information into microbial diversity analyses with k-mer frequency counting.

IF 4.6 2区 生物学 Q1 MICROBIOLOGY mSystems Pub Date : 2025-03-18 Epub Date: 2025-02-20 DOI:10.1128/msystems.01550-24
Nicholas A Bokulich
{"title":"Integrating sequence composition information into microbial diversity analyses with k-mer frequency counting.","authors":"Nicholas A Bokulich","doi":"10.1128/msystems.01550-24","DOIUrl":null,"url":null,"abstract":"<p><p>k-mer frequency information in biological sequences is used for a wide range of applications, including taxonomy classification, sequence similarity estimation, and supervised learning. However, in spite of its widespread utility, k-mer counting has been largely neglected for diversity estimation. This work examines the application of k-mer counting for alpha and beta diversity as well as supervised classification from microbiome marker-gene sequencing data sets (16S rRNA gene and full-length fungal internal transcribed spacer [ITS] sequences). Results demonstrate a close correspondence with phylogenetically aware diversity metrics, and advantages for using k-mer-based metrics for measuring microbial biodiversity in microbiome sequencing surveys. k-mer counting appears to be a suitable and efficient strategy for feature processing prior to diversity estimation as well as supervised learning in microbiome surveys. This allows the incorporation of subsequence-level information into diversity estimation without the computational cost of pairwise sequence alignment. k-mer counting is proposed as a complementary approach for feature processing prior to diversity estimation and supervised learning analyses, enabling large-scale reference-free profiling of microbiomes in biogeography, ecology, and biomedical data. A method for k-mer counting from marker-gene sequence data is implemented in the QIIME 2 plugin q2-kmerizer (https://github.com/bokulich-lab/q2-kmerizer).</p><p><strong>Importance: </strong>k-mers are all of the subsequences of length k that comprise a sequence. Comparing the frequency of k-mers in DNA sequences yields valuable information about the composition of these sequences and their similarity. This work demonstrates that k-mer frequencies from marker-gene sequence surveys can be used to inform diversity estimates and machine learning predictions that incorporate sequence composition information. Alpha and beta diversity estimates based on k-mer frequencies closely correspond to phylogenetically aware diversity metrics, suggesting that k-mer-based diversity estimates are useful proxy measurements especially when reliable phylogenies are not available, as is often the case for some DNA sequence targets such as for internal transcribed spacer sequences.</p>","PeriodicalId":18819,"journal":{"name":"mSystems","volume":" ","pages":"e0155024"},"PeriodicalIF":4.6000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11915819/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"mSystems","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1128/msystems.01550-24","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/20 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

k-mer frequency information in biological sequences is used for a wide range of applications, including taxonomy classification, sequence similarity estimation, and supervised learning. However, in spite of its widespread utility, k-mer counting has been largely neglected for diversity estimation. This work examines the application of k-mer counting for alpha and beta diversity as well as supervised classification from microbiome marker-gene sequencing data sets (16S rRNA gene and full-length fungal internal transcribed spacer [ITS] sequences). Results demonstrate a close correspondence with phylogenetically aware diversity metrics, and advantages for using k-mer-based metrics for measuring microbial biodiversity in microbiome sequencing surveys. k-mer counting appears to be a suitable and efficient strategy for feature processing prior to diversity estimation as well as supervised learning in microbiome surveys. This allows the incorporation of subsequence-level information into diversity estimation without the computational cost of pairwise sequence alignment. k-mer counting is proposed as a complementary approach for feature processing prior to diversity estimation and supervised learning analyses, enabling large-scale reference-free profiling of microbiomes in biogeography, ecology, and biomedical data. A method for k-mer counting from marker-gene sequence data is implemented in the QIIME 2 plugin q2-kmerizer (https://github.com/bokulich-lab/q2-kmerizer).

Importance: k-mers are all of the subsequences of length k that comprise a sequence. Comparing the frequency of k-mers in DNA sequences yields valuable information about the composition of these sequences and their similarity. This work demonstrates that k-mer frequencies from marker-gene sequence surveys can be used to inform diversity estimates and machine learning predictions that incorporate sequence composition information. Alpha and beta diversity estimates based on k-mer frequencies closely correspond to phylogenetically aware diversity metrics, suggesting that k-mer-based diversity estimates are useful proxy measurements especially when reliable phylogenies are not available, as is often the case for some DNA sequence targets such as for internal transcribed spacer sequences.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用k-mer频率计数将序列组成信息整合到微生物多样性分析中。
生物序列中的K-mer频率信息用于广泛的应用,包括分类分类、序列相似性估计和监督学习。然而,尽管它的广泛应用,k-mer计数在很大程度上被忽视了多样性估计。这项工作研究了k-mer计数在α和β多样性中的应用,以及微生物组标记基因测序数据集(16S rRNA基因和全长真菌内部转录间隔序列[ITS]序列)的监督分类。结果表明,k-mer与系统发育敏感的多样性指标密切对应,并且在微生物组测序调查中使用基于k-mer的指标来测量微生物多样性具有优势。在微生物组调查中,K-mer计数似乎是一种适合和有效的策略,用于多样性估计之前的特征处理以及监督学习。这允许将子序列级信息合并到多样性估计中,而不需要成对序列比对的计算成本。K-mer计数被提议作为多样性估计和监督学习分析之前特征处理的补充方法,实现生物地理学,生态学和生物医学数据中微生物组的大规模无参考分析。QIIME 2插件q2-kmerizer (https://github.com/bokulich-lab/q2-kmerizer).Importance: k-mers是包含序列长度为k的所有子序列。比较DNA序列中k-mers的频率可以获得有关这些序列的组成及其相似性的宝贵信息。这项工作表明,来自标记基因序列调查的k-mer频率可用于告知多样性估计和包含序列组成信息的机器学习预测。基于k-mer频率的α和β多样性估计与系统发育意识多样性指标密切对应,这表明基于k-mer的多样性估计是有用的代理测量,特别是当可靠的系统发育不可用时,就像一些DNA序列目标(如内部转录间隔序列)经常出现的情况一样。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
mSystems
mSystems Biochemistry, Genetics and Molecular Biology-Biochemistry
CiteScore
10.50
自引率
3.10%
发文量
308
审稿时长
13 weeks
期刊介绍: mSystems™ will publish preeminent work that stems from applying technologies for high-throughput analyses to achieve insights into the metabolic and regulatory systems at the scale of both the single cell and microbial communities. The scope of mSystems™ encompasses all important biological and biochemical findings drawn from analyses of large data sets, as well as new computational approaches for deriving these insights. mSystems™ will welcome submissions from researchers who focus on the microbiome, genomics, metagenomics, transcriptomics, metabolomics, proteomics, glycomics, bioinformatics, and computational microbiology. mSystems™ will provide streamlined decisions, while carrying on ASM''s tradition of rigorous peer review.
期刊最新文献
Lactiplantibacillus plantarum 082 ameliorates heat stress-induced testicular injury by modulating the gut microbiota. Genome-wide analysis exploring mechanisms used by Shigella sonnei to survive long-term nutrient starvation. Metagenome-assembled genomes from a population-based cohort uncover novel gut species and within-species diversity, revealing prevalent disease associations. Comparative genomic analyses of Escherichia coli ST405 strains from Pakistan. Atmospheric hydrogen consumption is regulated by glycerol-mediated catabolite repression in mycobacteria.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1