Miles D. Roberts, Olivia Davis, Emily B. Josephs, Robert J. Williamson
{"title":"基于 k-聚合体的连接泛基因组学和群体遗传学的方法","authors":"Miles D. Roberts, Olivia Davis, Emily B. Josephs, Robert J. Williamson","doi":"arxiv-2409.11683","DOIUrl":null,"url":null,"abstract":"Many commonly studied species now have more than one chromosome-scale genome\nassembly, revealing a large amount of genetic diversity previously missed by\napproaches that map short reads to a single reference. However, many species\nstill lack multiple reference genomes and correctly aligning references to\nbuild pangenomes is challenging, limiting our ability to study this missing\ngenomic variation in population genetics. Here, we argue that $k$-mers are a\ncrucial stepping stone to bridging the reference-focused paradigms of\npopulation genetics with the reference-free paradigms of pangenomics. We review\ncurrent literature on the uses of $k$-mers for performing three core components\nof most population genetics analyses: identifying, measuring, and explaining\npatterns of genetic variation. We also demonstrate how different $k$-mer-based\nmeasures of genetic variation behave in population genetic simulations\naccording to the choice of $k$, depth of sequencing coverage, and degree of\ndata compression. Overall, we find that $k$-mer-based measures of genetic\ndiversity scale consistently with pairwise nucleotide diversity ($\\pi$) up to\nvalues of about $\\pi = 0.025$ ($R^2 = 0.97$) for neutrally evolving\npopulations. For populations with even more variation, using shorter $k$-mers\nwill maintain the scalability up to at least $\\pi = 0.1$. Furthermore, in our\nsimulated populations, $k$-mer dissimilarity values can be reliably\napproximated from counting bloom filters, highlighting a potential avenue to\ndecreasing the memory burden of $k$-mer based genomic dissimilarity analyses.\nFor future studies, there is a great opportunity to further develop methods to\nidentifying selected loci using $k$-mers.","PeriodicalId":501044,"journal":{"name":"arXiv - QuanBio - Populations and Evolution","volume":"14 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"k-mer-based approaches to bridging pangenomics and population genetics\",\"authors\":\"Miles D. Roberts, Olivia Davis, Emily B. Josephs, Robert J. Williamson\",\"doi\":\"arxiv-2409.11683\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many commonly studied species now have more than one chromosome-scale genome\\nassembly, revealing a large amount of genetic diversity previously missed by\\napproaches that map short reads to a single reference. However, many species\\nstill lack multiple reference genomes and correctly aligning references to\\nbuild pangenomes is challenging, limiting our ability to study this missing\\ngenomic variation in population genetics. Here, we argue that $k$-mers are a\\ncrucial stepping stone to bridging the reference-focused paradigms of\\npopulation genetics with the reference-free paradigms of pangenomics. We review\\ncurrent literature on the uses of $k$-mers for performing three core components\\nof most population genetics analyses: identifying, measuring, and explaining\\npatterns of genetic variation. We also demonstrate how different $k$-mer-based\\nmeasures of genetic variation behave in population genetic simulations\\naccording to the choice of $k$, depth of sequencing coverage, and degree of\\ndata compression. Overall, we find that $k$-mer-based measures of genetic\\ndiversity scale consistently with pairwise nucleotide diversity ($\\\\pi$) up to\\nvalues of about $\\\\pi = 0.025$ ($R^2 = 0.97$) for neutrally evolving\\npopulations. For populations with even more variation, using shorter $k$-mers\\nwill maintain the scalability up to at least $\\\\pi = 0.1$. Furthermore, in our\\nsimulated populations, $k$-mer dissimilarity values can be reliably\\napproximated from counting bloom filters, highlighting a potential avenue to\\ndecreasing the memory burden of $k$-mer based genomic dissimilarity analyses.\\nFor future studies, there is a great opportunity to further develop methods to\\nidentifying selected loci using $k$-mers.\",\"PeriodicalId\":501044,\"journal\":{\"name\":\"arXiv - QuanBio - Populations and Evolution\",\"volume\":\"14 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Populations and Evolution\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11683\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Populations and Evolution","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11683","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
k-mer-based approaches to bridging pangenomics and population genetics
Many commonly studied species now have more than one chromosome-scale genome
assembly, revealing a large amount of genetic diversity previously missed by
approaches that map short reads to a single reference. However, many species
still lack multiple reference genomes and correctly aligning references to
build pangenomes is challenging, limiting our ability to study this missing
genomic variation in population genetics. Here, we argue that $k$-mers are a
crucial stepping stone to bridging the reference-focused paradigms of
population genetics with the reference-free paradigms of pangenomics. We review
current literature on the uses of $k$-mers for performing three core components
of most population genetics analyses: identifying, measuring, and explaining
patterns of genetic variation. We also demonstrate how different $k$-mer-based
measures of genetic variation behave in population genetic simulations
according to the choice of $k$, depth of sequencing coverage, and degree of
data compression. Overall, we find that $k$-mer-based measures of genetic
diversity scale consistently with pairwise nucleotide diversity ($\pi$) up to
values of about $\pi = 0.025$ ($R^2 = 0.97$) for neutrally evolving
populations. For populations with even more variation, using shorter $k$-mers
will maintain the scalability up to at least $\pi = 0.1$. Furthermore, in our
simulated populations, $k$-mer dissimilarity values can be reliably
approximated from counting bloom filters, highlighting a potential avenue to
decreasing the memory burden of $k$-mer based genomic dissimilarity analyses.
For future studies, there is a great opportunity to further develop methods to
identifying selected loci using $k$-mers.