Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching Scheme.

IF 1.4 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Journal of Computational Biology Pub Date : 2024-01-01 Epub Date: 2023-11-17 DOI:10.1089/cmb.2023.0212
Minh Hoang, Guillaume Marçais, Carl Kingsford
{"title":"Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching Scheme.","authors":"Minh Hoang, Guillaume Marçais, Carl Kingsford","doi":"10.1089/cmb.2023.0212","DOIUrl":null,"url":null,"abstract":"<p><p>Minimizers and syncmers are sketching methods that sample representative <i>k</i>-mer seeds from a long string. The minimizer scheme guarantees a well-spread <i>k</i>-mer sketch (high coverage) while seeking to minimize the sketch size (low density). The syncmer scheme yields sketches that are more robust to base substitutions (high conservation) on random sequences, but do not have the coverage guarantee of minimizers. These sketching metrics are generally adversarial to one another, especially in the context of sketch optimization for a specific sequence, and thus are difficult to be simultaneously achieved. The parameterized syncmer scheme was recently introduced as a generalization of syncmers with more flexible sampling rules and empirically better coverage than the original syncmer variants. However, no approach exists to optimize parameterized syncmers. To address this shortcoming, we introduce a new scheme called masked minimizers that generalizes minimizers in manner analogous to how parameterized syncmers generalize syncmers and allows us to extend existing optimization techniques developed for minimizers. This results in a practical algorithm to optimize the masked minimizer scheme with respect to both density and conservation. We evaluate the optimization algorithm on various benchmark genomes and show that our algorithm finds sketches that are overall more compact, well-spread, and robust to substitutions than those found by previous methods. Our implementation is released at https://github.com/Kingsford-Group/maskedminimizer. This new technique will enable more efficient and robust genomic analyses in the many settings where minimizers and syncmers are used.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.4000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10794853/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1089/cmb.2023.0212","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/11/17 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Minimizers and syncmers are sketching methods that sample representative k-mer seeds from a long string. The minimizer scheme guarantees a well-spread k-mer sketch (high coverage) while seeking to minimize the sketch size (low density). The syncmer scheme yields sketches that are more robust to base substitutions (high conservation) on random sequences, but do not have the coverage guarantee of minimizers. These sketching metrics are generally adversarial to one another, especially in the context of sketch optimization for a specific sequence, and thus are difficult to be simultaneously achieved. The parameterized syncmer scheme was recently introduced as a generalization of syncmers with more flexible sampling rules and empirically better coverage than the original syncmer variants. However, no approach exists to optimize parameterized syncmers. To address this shortcoming, we introduce a new scheme called masked minimizers that generalizes minimizers in manner analogous to how parameterized syncmers generalize syncmers and allows us to extend existing optimization techniques developed for minimizers. This results in a practical algorithm to optimize the masked minimizer scheme with respect to both density and conservation. We evaluate the optimization algorithm on various benchmark genomes and show that our algorithm finds sketches that are overall more compact, well-spread, and robust to substitutions than those found by previous methods. Our implementation is released at https://github.com/Kingsford-Group/maskedminimizer. This new technique will enable more efficient and robust genomic analyses in the many settings where minimizers and syncmers are used.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
广义掩模最小草图方案的密度和守恒优化。
最小化器和同步器是从长串中采样代表性k-mer种子的素描方法。最小化方案保证了良好的k-mer草图(高覆盖率),同时寻求最小化草图尺寸(低密度)。synsynmer方案产生的草图对随机序列上的基替换(高守恒)具有更强的鲁棒性,但不具有最小化的覆盖保证。这些草图度量通常是相互对抗的,特别是在特定序列的草图优化环境中,因此很难同时实现。最近引入了参数化同步器方案,作为同步器的一种推广,它具有更灵活的采样规则和经验上比原始同步器变体更好的覆盖率。然而,目前还没有优化参数化同步器的方法。为了解决这个缺点,我们引入了一种名为掩码最小化器的新方案,它以类似于参数化同步器泛化同步器的方式泛化最小化器,并允许我们扩展为最小化器开发的现有优化技术。这导致了一个实用的算法来优化掩模最小化方案,同时考虑密度和守恒。我们在各种基准基因组上评估了优化算法,并表明我们的算法发现的草图总体上比以前的方法发现的草图更紧凑、分布良好、对替换更健壮。我们的实现发布在https://github.com/Kingsford-Group/maskedminimizer。这项新技术将在许多使用最小化和同步器的环境中实现更有效和健壮的基因组分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Computational Biology
Journal of Computational Biology 生物-计算机:跨学科应用
CiteScore
3.60
自引率
5.90%
发文量
113
审稿时长
6-12 weeks
期刊介绍: Journal of Computational Biology is the leading peer-reviewed journal in computational biology and bioinformatics, publishing in-depth statistical, mathematical, and computational analysis of methods, as well as their practical impact. Available only online, this is an essential journal for scientists and students who want to keep abreast of developments in bioinformatics. Journal of Computational Biology coverage includes: -Genomics -Mathematical modeling and simulation -Distributed and parallel biological computing -Designing biological databases -Pattern matching and pattern detection -Linking disparate databases and data -New tools for computational biology -Relational and object-oriented database technology for bioinformatics -Biological expert system design and use -Reasoning by analogy, hypothesis formation, and testing by machine -Management of biological databases
期刊最新文献
A Hybrid GNN Approach for Improved Molecular Property Prediction. Protein-Protein Interaction Prediction Model Based on ProtBert-BiGRU-Attention. BiRNN-DDI: A Drug-Drug Interaction Event Type Prediction Model Based on Bidirectional Recurrent Neural Network and Graph2Seq Representation. SuperTAD-Fast: Accelerating Topologically Associating Domains Detection Through Discretization. CFINet: Cross-Modality MRI Feature Interaction Network for Pseudoprogression Prediction of Glioblastoma.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1