GraphSlimmer: Preserving Read Mappability with the Minimum Number of Variants.

IF 1.6 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Journal of Computational Biology Pub Date : 2024-07-01 Epub Date: 2024-07-11 DOI:10.1089/cmb.2024.0601
Neda Tavakoli, Daniel Gibney, Srinivas Aluru
{"title":"GraphSlimmer: Preserving Read Mappability with the Minimum Number of Variants.","authors":"Neda Tavakoli, Daniel Gibney, Srinivas Aluru","doi":"10.1089/cmb.2024.0601","DOIUrl":null,"url":null,"abstract":"<p><p>Modern genomic datasets, like those generated under the 1000 Genome Project, contain millions of variants belonging to known haplotypes. Although these datasets are more representative than a single reference sequence and can alleviate issues like reference bias, they are significantly more computationally burdensome to work with, often involving large-indexed genome graph data structures for tasks such as read mapping. The construction, preprocessing, and mapping algorithms can require substantial computational resources depending on the size of these variant sets. Moreover, the accuracy of mapping algorithms has been shown to decrease when working with complete variant sets. Therefore, a drastically reduced set of variants that preserves important properties of the original set is desirable. This work provides a technique for finding a minimal subset of variants <math><mi>S</mi></math> such that for given parameters <i>α</i> and <i>δ</i>, all substrings up to length <i>α</i> in the haplotypes are guaranteed to be still alignable to the appropriate locations with either Hamming or edit distance at most <i>δ</i>, using only <math><mi>S</mi></math>. Our contributions include showing the NP-hardness and inapproximability of these optimization problems and providing Integer Linear Programming (ILP) formulations. Our edit distance ILP formulation carefully decomposes the problem according to variant locations, which allows it to scale to support all of chromosome 22's variants from the 1000 Genome Project. Our experiments also demonstrate a significant reduction in the number of variants. For example, for moderately long reads, e.g., <i>α</i> = 1000, over 75% of the variants can be removed while preserving read mappability with edit distance at most one.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"616-637"},"PeriodicalIF":1.6000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1089/cmb.2024.0601","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/11 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Modern genomic datasets, like those generated under the 1000 Genome Project, contain millions of variants belonging to known haplotypes. Although these datasets are more representative than a single reference sequence and can alleviate issues like reference bias, they are significantly more computationally burdensome to work with, often involving large-indexed genome graph data structures for tasks such as read mapping. The construction, preprocessing, and mapping algorithms can require substantial computational resources depending on the size of these variant sets. Moreover, the accuracy of mapping algorithms has been shown to decrease when working with complete variant sets. Therefore, a drastically reduced set of variants that preserves important properties of the original set is desirable. This work provides a technique for finding a minimal subset of variants S such that for given parameters α and δ, all substrings up to length α in the haplotypes are guaranteed to be still alignable to the appropriate locations with either Hamming or edit distance at most δ, using only S. Our contributions include showing the NP-hardness and inapproximability of these optimization problems and providing Integer Linear Programming (ILP) formulations. Our edit distance ILP formulation carefully decomposes the problem according to variant locations, which allows it to scale to support all of chromosome 22's variants from the 1000 Genome Project. Our experiments also demonstrate a significant reduction in the number of variants. For example, for moderately long reads, e.g., α = 1000, over 75% of the variants can be removed while preserving read mappability with edit distance at most one.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
GraphSlimmer:以最少的变体数保持读取映射能力
现代基因组数据集(如在 1000 基因组计划下生成的数据集)包含数百万个属于已知单倍型的变体。虽然这些数据集比单一参考序列更有代表性,并能缓解参考偏差等问题,但它们的计算负担明显更重,通常涉及大型索引基因组图数据结构,以完成读图映射等任务。根据这些变异集的大小,构建、预处理和映射算法可能需要大量的计算资源。此外,在处理完整的变异集时,映射算法的准确性也会降低。因此,我们需要一个大幅缩减的变体集,同时保留原始变体集的重要属性。本研究提供了一种寻找最小变体子集 S 的技术,在给定参数 α 和 δ 的情况下,保证单倍型中长度不超过 α 的所有子串仍可对齐到适当位置,且汉明或编辑距离不超过 δ,只需使用 S。我们的编辑距离 ILP 方案根据变体位置对问题进行了细致的分解,从而使其能够支持来自 1000 基因组计划的 22 号染色体的所有变体。我们的实验还证明了变体数量的显著减少。例如,对于中等长度的读数(如 α = 1000),超过 75% 的变异可以被移除,同时保持读数的可映射性,编辑距离最多为 1。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Computational Biology
Journal of Computational Biology 生物-计算机:跨学科应用
CiteScore
3.60
自引率
5.90%
发文量
113
审稿时长
6-12 weeks
期刊介绍: Journal of Computational Biology is the leading peer-reviewed journal in computational biology and bioinformatics, publishing in-depth statistical, mathematical, and computational analysis of methods, as well as their practical impact. Available only online, this is an essential journal for scientists and students who want to keep abreast of developments in bioinformatics. Journal of Computational Biology coverage includes: -Genomics -Mathematical modeling and simulation -Distributed and parallel biological computing -Designing biological databases -Pattern matching and pattern detection -Linking disparate databases and data -New tools for computational biology -Relational and object-oriented database technology for bioinformatics -Biological expert system design and use -Reasoning by analogy, hypothesis formation, and testing by machine -Management of biological databases
期刊最新文献
Learning Protein Structure Representation with Orientation-Aware Networks. Elucidating Transitions of k-mer-Based Objects Across k-mer Sizes. A Probabilistic Algorithm for Gene-Species Reconciliation with Segmental Duplications. Conformal Prediction with Knowledge Graphs for Reliable Antimicrobial Resistance Detection with MALDI-TOF Mass Spectra. Col-BWT: Pangenomic Seed Chaining with Maximal Matches Improves Read Classification.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1