{"title":"The Statistics of Parametrized Syncmers in a Simple Mutation Process Without Spurious Matches.","authors":"John L Spouge, Pijush Das, Ye Chen, Martin Frith","doi":"10.1089/cmb.2024.0508","DOIUrl":null,"url":null,"abstract":"<p><p><b><i>Introduction:</i></b> Often, bioinformatics uses summary sketches to analyze next-generation sequencing data, but most sketches are not well understood statistically. Under a simple mutation model, Blanca et al. analyzed complete sketches, that is, the complete set of unassembled <i>k</i>-mers, from two closely related sequences. The analysis extracted a point mutation parameter θ quantifying the evolutionary distance between the two sequences. <b><i>Methods:</i></b> We extend the results of Blanca et al. for complete sketches to parametrized syncmer sketches with downsampling. A syncmer sketch can sample <i>k</i>-mers much more sparsely than a complete sketch. Consider the following simple mutation model disallowing insertions or deletions. Consider a reference sequence <i>A</i> (e.g., a subsequence from a reference genome), and mutate each nucleotide in it independently with probability θ to produce a mutated sequence <i>B</i> (corresponding to, e.g., a set of reads or draft assembly of a related genome). Then, syncmer counts alone yield an approximate Gaussian distribution for estimating θ. The assumption disallowing insertions and deletions motivates a check on the lengths of <i>A</i> and <i>B</i>. The syncmer count from <i>B</i> yields an approximate Gaussian distribution for its length, and a <i>p</i>-value can test the length of <i>B</i> against the length of <i>A</i> using syncmer counts alone. <b><i>Results:</i></b> The Gaussian distributions permit syncmer counts alone to estimate θ and mutated sequence length with a known sampling error. Under some circumstances, the results provide the sampling error for the Mash containment index when applied to syncmer counts. <b><i>Conclusions:</i></b> The approximate Gaussian distributions provide hypothesis tests and confidence intervals for phylogenetic distance and sequence length. Our methods are likely to generalize to sketches other than syncmers and may be useful in assembling reads and related applications.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4000,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1089/cmb.2024.0508","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: Often, bioinformatics uses summary sketches to analyze next-generation sequencing data, but most sketches are not well understood statistically. Under a simple mutation model, Blanca et al. analyzed complete sketches, that is, the complete set of unassembled k-mers, from two closely related sequences. The analysis extracted a point mutation parameter θ quantifying the evolutionary distance between the two sequences. Methods: We extend the results of Blanca et al. for complete sketches to parametrized syncmer sketches with downsampling. A syncmer sketch can sample k-mers much more sparsely than a complete sketch. Consider the following simple mutation model disallowing insertions or deletions. Consider a reference sequence A (e.g., a subsequence from a reference genome), and mutate each nucleotide in it independently with probability θ to produce a mutated sequence B (corresponding to, e.g., a set of reads or draft assembly of a related genome). Then, syncmer counts alone yield an approximate Gaussian distribution for estimating θ. The assumption disallowing insertions and deletions motivates a check on the lengths of A and B. The syncmer count from B yields an approximate Gaussian distribution for its length, and a p-value can test the length of B against the length of A using syncmer counts alone. Results: The Gaussian distributions permit syncmer counts alone to estimate θ and mutated sequence length with a known sampling error. Under some circumstances, the results provide the sampling error for the Mash containment index when applied to syncmer counts. Conclusions: The approximate Gaussian distributions provide hypothesis tests and confidence intervals for phylogenetic distance and sequence length. Our methods are likely to generalize to sketches other than syncmers and may be useful in assembling reads and related applications.
简介生物信息学通常使用摘要草图来分析新一代测序数据,但大多数草图在统计学上并不十分清楚。在一个简单的突变模型下,Blanca 等人分析了两个密切相关序列的完整草图,即未组装 k-mers 的完整集合。该分析提取了一个点突变参数θ,量化了两个序列之间的进化距离。方法我们将 Blanca 等人对完整草图的研究结果扩展到了参数化同步草图与下采样。与完整草图相比,同步草图对 k-mers 的采样要稀疏得多。考虑以下不允许插入或删除的简单突变模型。考虑一个参考序列 A(例如参考基因组的一个子序列),并以概率 θ 对其中的每个核苷酸进行独立突变,以产生一个突变序列 B(例如对应于一组读数或相关基因组的组装草案)。由于假设不允许插入和删除,因此需要对 A 和 B 的长度进行检验。B 的突变计数可得出其长度的近似高斯分布,通过 p 值可以检验 B 的长度与仅使用突变计数的 A 的长度是否一致。结果:高斯分布允许在已知抽样误差的情况下,仅用同步器计数来估计θ和变异序列长度。在某些情况下,结果提供了应用于同步器计数的马什包含指数的抽样误差。结论近似高斯分布为系统发育距离和序列长度提供了假设检验和置信区间。我们的方法很可能适用于同步器以外的草图,并可能在组装读数和相关应用中有用。
期刊介绍:
Journal of Computational Biology is the leading peer-reviewed journal in computational biology and bioinformatics, publishing in-depth statistical, mathematical, and computational analysis of methods, as well as their practical impact. Available only online, this is an essential journal for scientists and students who want to keep abreast of developments in bioinformatics.
Journal of Computational Biology coverage includes:
-Genomics
-Mathematical modeling and simulation
-Distributed and parallel biological computing
-Designing biological databases
-Pattern matching and pattern detection
-Linking disparate databases and data
-New tools for computational biology
-Relational and object-oriented database technology for bioinformatics
-Biological expert system design and use
-Reasoning by analogy, hypothesis formation, and testing by machine
-Management of biological databases