Katharine M. Jenike, Lucía Campos-Domínguez, Marilou Boddé, José Cerca, Christina N. Hodson, Michael C. Schatz, Kamil S. Jaron
{"title":"Guide to k-mer approaches for genomics across the tree of life","authors":"Katharine M. Jenike, Lucía Campos-Domínguez, Marilou Boddé, José Cerca, Christina N. Hodson, Michael C. Schatz, Kamil S. Jaron","doi":"arxiv-2404.01519","DOIUrl":null,"url":null,"abstract":"The wide array of currently available genomes display a wonderful diversity\nin size, composition and structure with many more to come thanks to several\nglobal biodiversity genomics initiatives starting in recent years. However,\nsequencing of genomes, even with all the recent advances, can still be\nchallenging for both technical (e.g. small physical size, contaminated samples,\nor access to appropriate sequencing platforms) and biological reasons (e.g.\ngermline restricted DNA, variable ploidy levels, sex chromosomes, or very large\ngenomes). In recent years, k-mer-based techniques have become popular to\novercome some of these challenges. They are based on the simple process of\ndividing the analysed sequences (e.g. raw reads or genomes) into a set of\nsub-sequences of length k, called k-mers. Despite this apparent simplicity,\nk-mer-based analysis allows for a rapid and intuitive assessment of complex\nsequencing datasets. Here, we provide the first comprehensive review to the\ntheoretical properties and practical applications of k-mers in biodiversity\ngenomics, serving as a reference manual for this powerful approach.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"42 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2404.01519","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The wide array of currently available genomes display a wonderful diversity
in size, composition and structure with many more to come thanks to several
global biodiversity genomics initiatives starting in recent years. However,
sequencing of genomes, even with all the recent advances, can still be
challenging for both technical (e.g. small physical size, contaminated samples,
or access to appropriate sequencing platforms) and biological reasons (e.g.
germline restricted DNA, variable ploidy levels, sex chromosomes, or very large
genomes). In recent years, k-mer-based techniques have become popular to
overcome some of these challenges. They are based on the simple process of
dividing the analysed sequences (e.g. raw reads or genomes) into a set of
sub-sequences of length k, called k-mers. Despite this apparent simplicity,
k-mer-based analysis allows for a rapid and intuitive assessment of complex
sequencing datasets. Here, we provide the first comprehensive review to the
theoretical properties and practical applications of k-mers in biodiversity
genomics, serving as a reference manual for this powerful approach.
由于近年来开始的一些全球生物多样性基因组学计划,目前可用的大量基因组在大小、组成和结构上都呈现出了奇妙的多样性,而且还有更多的基因组即将问世。然而,即使基因组测序取得了最新进展,但由于技术(如物理尺寸小、样本受污染或难以获得合适的测序平台)和生物学(如种系受限 DNA、倍性水平不一、性染色体或超大基因组)等原因,测序工作仍然充满挑战。近年来,基于 k-mer的技术开始流行,以克服其中的一些挑战。它们的基础是将分析序列(如原始读数或基因组)划分为一组长度为 k 的子序列(称为 k-mers)的简单过程。尽管表面上看似简单,但基于 k 分子的分析可以快速、直观地评估复杂的测序数据集。在这里,我们首次全面评述了生物多样性基因组学中 k 分子的理论特性和实际应用,为这种强大的方法提供了参考手册。