基于 k-聚合体的连接泛基因组学和群体遗传学的方法

Miles D. Roberts, Olivia Davis, Emily B. Josephs, Robert J. Williamson
{"title":"基于 k-聚合体的连接泛基因组学和群体遗传学的方法","authors":"Miles D. Roberts, Olivia Davis, Emily B. Josephs, Robert J. Williamson","doi":"arxiv-2409.11683","DOIUrl":null,"url":null,"abstract":"Many commonly studied species now have more than one chromosome-scale genome\nassembly, revealing a large amount of genetic diversity previously missed by\napproaches that map short reads to a single reference. However, many species\nstill lack multiple reference genomes and correctly aligning references to\nbuild pangenomes is challenging, limiting our ability to study this missing\ngenomic variation in population genetics. Here, we argue that $k$-mers are a\ncrucial stepping stone to bridging the reference-focused paradigms of\npopulation genetics with the reference-free paradigms of pangenomics. We review\ncurrent literature on the uses of $k$-mers for performing three core components\nof most population genetics analyses: identifying, measuring, and explaining\npatterns of genetic variation. We also demonstrate how different $k$-mer-based\nmeasures of genetic variation behave in population genetic simulations\naccording to the choice of $k$, depth of sequencing coverage, and degree of\ndata compression. Overall, we find that $k$-mer-based measures of genetic\ndiversity scale consistently with pairwise nucleotide diversity ($\\pi$) up to\nvalues of about $\\pi = 0.025$ ($R^2 = 0.97$) for neutrally evolving\npopulations. For populations with even more variation, using shorter $k$-mers\nwill maintain the scalability up to at least $\\pi = 0.1$. Furthermore, in our\nsimulated populations, $k$-mer dissimilarity values can be reliably\napproximated from counting bloom filters, highlighting a potential avenue to\ndecreasing the memory burden of $k$-mer based genomic dissimilarity analyses.\nFor future studies, there is a great opportunity to further develop methods to\nidentifying selected loci using $k$-mers.","PeriodicalId":501044,"journal":{"name":"arXiv - QuanBio - Populations and Evolution","volume":"14 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"k-mer-based approaches to bridging pangenomics and population genetics\",\"authors\":\"Miles D. Roberts, Olivia Davis, Emily B. Josephs, Robert J. Williamson\",\"doi\":\"arxiv-2409.11683\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many commonly studied species now have more than one chromosome-scale genome\\nassembly, revealing a large amount of genetic diversity previously missed by\\napproaches that map short reads to a single reference. However, many species\\nstill lack multiple reference genomes and correctly aligning references to\\nbuild pangenomes is challenging, limiting our ability to study this missing\\ngenomic variation in population genetics. Here, we argue that $k$-mers are a\\ncrucial stepping stone to bridging the reference-focused paradigms of\\npopulation genetics with the reference-free paradigms of pangenomics. We review\\ncurrent literature on the uses of $k$-mers for performing three core components\\nof most population genetics analyses: identifying, measuring, and explaining\\npatterns of genetic variation. We also demonstrate how different $k$-mer-based\\nmeasures of genetic variation behave in population genetic simulations\\naccording to the choice of $k$, depth of sequencing coverage, and degree of\\ndata compression. Overall, we find that $k$-mer-based measures of genetic\\ndiversity scale consistently with pairwise nucleotide diversity ($\\\\pi$) up to\\nvalues of about $\\\\pi = 0.025$ ($R^2 = 0.97$) for neutrally evolving\\npopulations. For populations with even more variation, using shorter $k$-mers\\nwill maintain the scalability up to at least $\\\\pi = 0.1$. Furthermore, in our\\nsimulated populations, $k$-mer dissimilarity values can be reliably\\napproximated from counting bloom filters, highlighting a potential avenue to\\ndecreasing the memory burden of $k$-mer based genomic dissimilarity analyses.\\nFor future studies, there is a great opportunity to further develop methods to\\nidentifying selected loci using $k$-mers.\",\"PeriodicalId\":501044,\"journal\":{\"name\":\"arXiv - QuanBio - Populations and Evolution\",\"volume\":\"14 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Populations and Evolution\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11683\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Populations and Evolution","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11683","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

现在,许多常用的研究物种都有一个以上的染色体级基因组组装,这揭示了以前将短读数映射到单一参考文献的方法所遗漏的大量遗传多样性。然而,许多物种仍然缺乏多个参考基因组,正确比对参考基因组以构建泛基因组是一项挑战,这限制了我们在群体遗传学中研究这种缺失的基因组变异的能力。在这里,我们认为,$k$-mers 是连接种群遗传学以参考文献为中心的范式与泛基因组学无参考文献范式的重要基石。我们回顾了目前关于使用 $k$-mers 进行大多数群体遗传学分析的三个核心部分的文献:识别、测量和解释遗传变异模式。我们还展示了在群体遗传模拟中,根据 k$的选择、测序覆盖的深度和数据压缩的程度,不同的基于 k$-mer的遗传变异度量是如何表现的。总体而言,我们发现对于中性进化的种群,基于k$-mer的遗传多样性测量值与核苷酸对多样性($\pi$)的比例一致,最高值约为$\pi = 0.025$($R^2 = 0.97$)。对于变异更多的种群,使用更短的 $k$ 媒介将保持至少 $\pi = 0.1$ 的可扩展性。此外,在我们模拟的种群中,$k$-单体的相似性值可以通过计数绽放滤波器得到可靠的近似值,这为减少基于$k$-单体的基因组相似性分析的记忆负担提供了潜在的途径。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
k-mer-based approaches to bridging pangenomics and population genetics
Many commonly studied species now have more than one chromosome-scale genome assembly, revealing a large amount of genetic diversity previously missed by approaches that map short reads to a single reference. However, many species still lack multiple reference genomes and correctly aligning references to build pangenomes is challenging, limiting our ability to study this missing genomic variation in population genetics. Here, we argue that $k$-mers are a crucial stepping stone to bridging the reference-focused paradigms of population genetics with the reference-free paradigms of pangenomics. We review current literature on the uses of $k$-mers for performing three core components of most population genetics analyses: identifying, measuring, and explaining patterns of genetic variation. We also demonstrate how different $k$-mer-based measures of genetic variation behave in population genetic simulations according to the choice of $k$, depth of sequencing coverage, and degree of data compression. Overall, we find that $k$-mer-based measures of genetic diversity scale consistently with pairwise nucleotide diversity ($\pi$) up to values of about $\pi = 0.025$ ($R^2 = 0.97$) for neutrally evolving populations. For populations with even more variation, using shorter $k$-mers will maintain the scalability up to at least $\pi = 0.1$. Furthermore, in our simulated populations, $k$-mer dissimilarity values can be reliably approximated from counting bloom filters, highlighting a potential avenue to decreasing the memory burden of $k$-mer based genomic dissimilarity analyses. For future studies, there is a great opportunity to further develop methods to identifying selected loci using $k$-mers.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Biological arrow of time: Emergence of tangled information hierarchies and self-modelling dynamics k-mer-based approaches to bridging pangenomics and population genetics A weather-driven mathematical model of Culex population abundance and the impact of vector control interventions Dynamics of solutions to a multi-patch epidemic model with a saturation incidence mechanism Higher-order interactions in random Lotka-Volterra communities
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1