GraphSlimmer: Preserving Read Mappability with the Minimum Number of Variants.

IF 1.6 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS Journal of Computational Biology Pub Date : 2024-07-01 Epub Date: 2024-07-11 DOI:10.1089/cmb.2024.0601

Neda Tavakoli, Daniel Gibney, Srinivas Aluru

{"title":"GraphSlimmer: Preserving Read Mappability with the Minimum Number of Variants.","authors":"Neda Tavakoli, Daniel Gibney, Srinivas Aluru","doi":"10.1089/cmb.2024.0601","DOIUrl":null,"url":null,"abstract":"Modern genomic datasets, like those generated under the 1000 Genome Project, contain millions of variants belonging to known haplotypes. Although these datasets are more representative than a single reference sequence and can alleviate issues like reference bias, they are significantly more computationally burdensome to work with, often involving large-indexed genome graph data structures for tasks such as read mapping. The construction, preprocessing, and mapping algorithms can require substantial computational resources depending on the size of these variant sets. Moreover, the accuracy of mapping algorithms has been shown to decrease when working with complete variant sets. Therefore, a drastically reduced set of variants that preserves important properties of the original set is desirable. This work provides a technique for finding a minimal subset of variants <math><mi>S</mi></math> such that for given parameters α and δ, all substrings up to length α in the haplotypes are guaranteed to be still alignable to the appropriate locations with either Hamming or edit distance at most δ, using only <math><mi>S</mi></math>. Our contributions include showing the NP-hardness and inapproximability of these optimization problems and providing Integer Linear Programming (ILP) formulations. Our edit distance ILP formulation carefully decomposes the problem according to variant locations, which allows it to scale to support all of chromosome 22's variants from the 1000 Genome Project. Our experiments also demonstrate a significant reduction in the number of variants. For example, for moderately long reads, e.g., α = 1000, over 75% of the variants can be removed while preserving read mappability with edit distance at most one.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"616-637"},"PeriodicalIF":1.6000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1089/cmb.2024.0601","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/11 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Modern genomic datasets, like those generated under the 1000 Genome Project, contain millions of variants belonging to known haplotypes. Although these datasets are more representative than a single reference sequence and can alleviate issues like reference bias, they are significantly more computationally burdensome to work with, often involving large-indexed genome graph data structures for tasks such as read mapping. The construction, preprocessing, and mapping algorithms can require substantial computational resources depending on the size of these variant sets. Moreover, the accuracy of mapping algorithms has been shown to decrease when working with complete variant sets. Therefore, a drastically reduced set of variants that preserves important properties of the original set is desirable. This work provides a technique for finding a minimal subset of variants $S$ such that for given parameters α and δ, all substrings up to length α in the haplotypes are guaranteed to be still alignable to the appropriate locations with either Hamming or edit distance at most δ, using only $S$ . Our contributions include showing the NP-hardness and inapproximability of these optimization problems and providing Integer Linear Programming (ILP) formulations. Our edit distance ILP formulation carefully decomposes the problem according to variant locations, which allows it to scale to support all of chromosome 22's variants from the 1000 Genome Project. Our experiments also demonstrate a significant reduction in the number of variants. For example, for moderately long reads, e.g., α = 1000, over 75% of the variants can be removed while preserving read mappability with edit distance at most one.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

GraphSlimmer：以最少的变体数保持读取映射能力

现代基因组数据集（如在 1000 基因组计划下生成的数据集）包含数百万个属于已知单倍型的变体。虽然这些数据集比单一参考序列更有代表性，并能缓解参考偏差等问题，但它们的计算负担明显更重，通常涉及大型索引基因组图数据结构，以完成读图映射等任务。根据这些变异集的大小，构建、预处理和映射算法可能需要大量的计算资源。此外，在处理完整的变异集时，映射算法的准确性也会降低。因此，我们需要一个大幅缩减的变体集，同时保留原始变体集的重要属性。本研究提供了一种寻找最小变体子集 S 的技术，在给定参数 α 和 δ 的情况下，保证单倍型中长度不超过 α 的所有子串仍可对齐到适当位置，且汉明或编辑距离不超过 δ，只需使用 S。我们的编辑距离 ILP 方案根据变体位置对问题进行了细致的分解，从而使其能够支持来自 1000 基因组计划的 22 号染色体的所有变体。我们的实验还证明了变体数量的显著减少。例如，对于中等长度的读数（如 α = 1000），超过 75% 的变异可以被移除，同时保持读数的可映射性，编辑距离最多为 1。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Computational Biology 生物-计算机：跨学科应用

CiteScore

3.60

自引率

5.90%

发文量

113

审稿时长

6-12 weeks

期刊介绍： Journal of Computational Biology is the leading peer-reviewed journal in computational biology and bioinformatics, publishing in-depth statistical, mathematical, and computational analysis of methods, as well as their practical impact. Available only online, this is an essential journal for scientists and students who want to keep abreast of developments in bioinformatics. Journal of Computational Biology coverage includes: -Genomics -Mathematical modeling and simulation -Distributed and parallel biological computing -Designing biological databases -Pattern matching and pattern detection -Linking disparate databases and data -New tools for computational biology -Relational and object-oriented database technology for bioinformatics -Biological expert system design and use -Reasoning by analogy, hypothesis formation, and testing by machine -Management of biological databases