首页 > 最新文献

Algorithms for Molecular Biology最新文献

英文 中文
b-move: faster lossless approximate pattern matching in a run-length compressed index. B-move:在运行长度压缩索引中更快的无损近似模式匹配。
IF 1.7 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-08-12 DOI: 10.1186/s13015-025-00281-x
Lore Depuydt, Luca Renders, Simon Van de Vyver, Lennart Veys, Travis Gagie, Jan Fostier

Background: Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.'s r-index and Nishimoto and Tabei's move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns.

Results: We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient, lossless approximate pattern matching in run-length compressed space. It achieves bidirectional character extensions up to 7 times faster than the br-index, closing the performance gap with FM-index-based alternatives. For locating occurrences, b-move performs ϕ and ϕ - 1 operations up to 7 times faster than the br-index. At the same time, it maintains the favorable memory characteristics of the br-index, for example, all available complete E. coli genomes on NCBI's RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop.

Conclusions: b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.

背景:由于高质量基因组序列的可用性越来越高,泛基因组在许多生物信息学管道中逐渐取代单一共识的参考基因组,以更好地捕获遗传多样性。使用传统生物信息学工具的FM-index在如此大的基因组集合中面临内存限制。最近在运行长度压缩索引方面的进展,如Gagie等人的r-index和Nishimoto和Tabei的move结构,缓解了内存限制,但主要集中在mems查找的向后搜索上。Arakawa等人的br-index在运行长度压缩空间中使用双向搜索启动完整的近似模式匹配,但由于复杂的内存访问模式,计算开销很大。结果:我们引入了b-move,一种新的move结构的双向扩展,在运行长度压缩空间中实现快速,缓存高效,无损的近似模式匹配。它实现双向字符扩展的速度比br索引快7倍,缩小了与基于fm索引的替代品的性能差距。对于定位事件,b-move执行ϕ和ϕ - 1操作比br-index快7倍。同时,它保持了br-index的有利的存储特性,例如,NCBI的RefSeq集合中所有可用的完整大肠杆菌基因组都可以编译成一个b-move索引,适合一台典型笔记本电脑的RAM。结论:b-move对泛基因组的索引和查询具有实用性和可扩展性。我们提供了b-move的c++实现,支持高效的无损近似模式匹配,包括定位功能,可在AGPL-3.0许可下在https://github.com/biointec/b-move获得。
{"title":"b-move: faster lossless approximate pattern matching in a run-length compressed index.","authors":"Lore Depuydt, Luca Renders, Simon Van de Vyver, Lennart Veys, Travis Gagie, Jan Fostier","doi":"10.1186/s13015-025-00281-x","DOIUrl":"10.1186/s13015-025-00281-x","url":null,"abstract":"<p><strong>Background: </strong>Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.'s r-index and Nishimoto and Tabei's move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns.</p><p><strong>Results: </strong>We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient, lossless approximate pattern matching in run-length compressed space. It achieves bidirectional character extensions up to 7 times faster than the br-index, closing the performance gap with FM-index-based alternatives. For locating occurrences, b-move performs <math><mi>ϕ</mi></math> and <math><msup><mi>ϕ</mi> <mrow><mo>-</mo> <mn>1</mn></mrow> </msup> </math> operations up to 7 times faster than the br-index. At the same time, it maintains the favorable memory characteristics of the br-index, for example, all available complete E. coli genomes on NCBI's RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop.</p><p><strong>Conclusions: </strong>b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"15"},"PeriodicalIF":1.7,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12345024/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144838492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NANUQ+: A divide-and-conquer approach to network estimation. NANUQ+:网络估计的分而治之方法。
IF 1.7 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-07-25 DOI: 10.1186/s13015-025-00274-w
Elizabeth S Allman, Hector Baños, John A Rhodes, Kristina Wicke

Inference of a species network from genomic data remains a difficult problem, with recent progress mostly limited to the level-1 case. However, inference of the Tree of Blobs of a network, showing only the network's cut edges, can be performed for any network by TINNiK, suggesting a divide-and-conquer approach to network inference where the tree's multifurcations are individually resolved to give more detailed structure. Here we develop a method, NANUQ + , to quickly perform such a level-1 resolution. Viewed as part of the NANUQ pipeline for fast level-1 inference, this gives tools for both understanding when the level-1 assumption is likely to be met and for exploring all highly-supported resolutions to cycles.

从基因组数据推断物种网络仍然是一个难题,最近的进展大多局限于1级情况。然而,对网络的Blobs树的推理,只显示网络的切割边缘,可以通过TINNiK对任何网络执行,这表明了一种分而治之的网络推理方法,其中树的多功能被单独解析以给出更详细的结构。在这里,我们开发了一种方法,NANUQ +,以快速执行这样的一级分辨率。作为快速一级推理的NANUQ管道的一部分,这为理解何时可能满足一级假设和探索所有高度支持的循环分辨率提供了工具。
{"title":"NANUQ<sup>+</sup>: A divide-and-conquer approach to network estimation.","authors":"Elizabeth S Allman, Hector Baños, John A Rhodes, Kristina Wicke","doi":"10.1186/s13015-025-00274-w","DOIUrl":"10.1186/s13015-025-00274-w","url":null,"abstract":"<p><p>Inference of a species network from genomic data remains a difficult problem, with recent progress mostly limited to the level-1 case. However, inference of the Tree of Blobs of a network, showing only the network's cut edges, can be performed for any network by TINNiK, suggesting a divide-and-conquer approach to network inference where the tree's multifurcations are individually resolved to give more detailed structure. Here we develop a method, <math><msup><mtext>NANUQ</mtext> <mo>+</mo></msup> </math> , to quickly perform such a level-1 resolution. Viewed as part of the NANUQ pipeline for fast level-1 inference, this gives tools for both understanding when the level-1 assumption is likely to be met and for exploring all highly-supported resolutions to cycles.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"14"},"PeriodicalIF":1.7,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12297685/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144719080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Swiftly identifying strongly unique k-mers. 快速识别强烈独特的k-mers。
IF 1.5 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-07-13 DOI: 10.1186/s13015-025-00286-6
Jens Zentgraf, Sven Rahmann

Motivation: Short DNA sequences of length k that appear in a single location (e.g., at a single genomic position, in a single species from a larger set of species, etc.) are called unique k-mers. They are useful for placing sequenced DNA fragments at the correct location without computing alignments and without ambiguity. However, they are not necessarily robust: A single basepair change may turn a unique k-mer into a different one that may in fact be present at one or more different locations, which may give confusing or contradictory information when attempting to place a read by its k-mer content. A more robust concept are strongly unique k-mers, i.e., unique k-mers for which no Hamming-distance-1 neighbor with conflicting information exists in all of the considered sequences. Given a set of k-mers, it is therefore of interest to have an efficient method that can distinguish k-mers with a Hamming-distance-1 neighbor in the collection from those that do not.

Results: We present engineered algorithms to identify and mark within a set K of (canonical) k-mers all elements that have a Hamming-distance-1 neighbor in the same set. One algorithm is based on recursively running a 4-way comparison on sub-intervals of the sorted set. The other algorithm is based on bucketing and running a pairwise bit-parallel Hamming distance test on small buckets of the sorted set. Both methods consider canonical k-mers (i.e., taking reverse complements into account) and allow for efficient parallelization. The methods have been implemented and applied in practice to sets consisting of several billions of k-mers. An optimized combined approach running with 16 threads on a 16-core workstation yields wall times below 20 seconds on the 2.5 billion distinct 31-mers of the human telomere-to-telomere reference genome.

Availability: An implementation can be found at https://gitlab.com/rahmannlab/strong-k-mers .

动机:出现在单个位置的长度为k的短DNA序列(例如,在单个基因组位置,在较大物种集合中的单个物种中,等等)被称为唯一k-mers。它们有助于将已测序的DNA片段放置在正确的位置,而不需要计算比对,也不会产生歧义。然而,它们并不一定是强大的:单个碱基对的改变可能会将一个独特的k-mer变成一个不同的k-mer,实际上可能存在于一个或多个不同的位置,这可能会在试图通过k-mer内容放置读取时提供混淆或矛盾的信息。一个更鲁棒的概念是强唯一k-mers,即在所有考虑的序列中不存在具有冲突信息的hming -distance-1邻居的唯一k-mers。给定一组k-mers,因此有兴趣找到一种有效的方法来区分集合中具有汉明距离为1的k-mers和那些没有汉明距离为1的k-mers。结果:我们提出了一种工程算法来识别和标记K个(规范)K -mers集合中相同集合中具有hming -distance-1邻居的所有元素。一种算法基于递归地在排序集的子区间上运行4路比较。另一种算法是基于桶,并在排序集的小桶上运行成对并行的汉明距离测试。这两种方法都考虑了标准k-mers(即,考虑了反向互补),并允许有效的并行化。这些方法已经在实践中实现并应用于由数十亿k-mers组成的集合。在16核工作站上运行16个线程的优化组合方法在人类端粒到端粒参考基因组的25亿个不同的31米上的壁时间低于20秒。可用性:可以在https://gitlab.com/rahmannlab/strong-k-mers上找到实现。
{"title":"Swiftly identifying strongly unique k-mers.","authors":"Jens Zentgraf, Sven Rahmann","doi":"10.1186/s13015-025-00286-6","DOIUrl":"10.1186/s13015-025-00286-6","url":null,"abstract":"<p><strong>Motivation: </strong>Short DNA sequences of length k that appear in a single location (e.g., at a single genomic position, in a single species from a larger set of species, etc.) are called unique k-mers. They are useful for placing sequenced DNA fragments at the correct location without computing alignments and without ambiguity. However, they are not necessarily robust: A single basepair change may turn a unique k-mer into a different one that may in fact be present at one or more different locations, which may give confusing or contradictory information when attempting to place a read by its k-mer content. A more robust concept are strongly unique k-mers, i.e., unique k-mers for which no Hamming-distance-1 neighbor with conflicting information exists in all of the considered sequences. Given a set of k-mers, it is therefore of interest to have an efficient method that can distinguish k-mers with a Hamming-distance-1 neighbor in the collection from those that do not.</p><p><strong>Results: </strong>We present engineered algorithms to identify and mark within a set K of (canonical) k-mers all elements that have a Hamming-distance-1 neighbor in the same set. One algorithm is based on recursively running a 4-way comparison on sub-intervals of the sorted set. The other algorithm is based on bucketing and running a pairwise bit-parallel Hamming distance test on small buckets of the sorted set. Both methods consider canonical k-mers (i.e., taking reverse complements into account) and allow for efficient parallelization. The methods have been implemented and applied in practice to sets consisting of several billions of k-mers. An optimized combined approach running with 16 threads on a 16-core workstation yields wall times below 20 seconds on the 2.5 billion distinct 31-mers of the human telomere-to-telomere reference genome.</p><p><strong>Availability: </strong>An implementation can be found at https://gitlab.com/rahmannlab/strong-k-mers .</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"13"},"PeriodicalIF":1.5,"publicationDate":"2025-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12257829/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144627683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Anchorage accurately assembles anchor-flanked synthetic long reads. 锚固准确地组装锚侧合成长读取。
IF 1.5 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-07-06 DOI: 10.1186/s13015-025-00288-4
Xiaofei Carl Zang, Xiang Li, Kyle Metcalfe, Tuval Ben-Yehezkel, Ryan Kelley, Mingfu Shao

Modern sequencing technologies allow for the addition of short-sequence tags, known as anchors, to both ends of a captured molecule. Anchors are useful in assembling the full-length sequence of a captured molecule as they can be used to accurately determine the endpoints. One representative of such anchor-enabled technology is LoopSeq Solo, a synthetic long read (SLR) sequencing protocol. LoopSeq Solo also achieves ultra-high sequencing depth and high purity of short reads covering the entire captured molecule. Despite the availability of many assembly methods, constructing full-length sequence from these anchor-enabled, ultra-high coverage sequencing data remains challenging due to the complexity of the underlying assembly graphs and the lack of specific algorithms leveraging anchors. We present Anchorage, a novel assembler that performs anchor-guided assembly for ultra-high-depth sequencing data. Anchorage starts with a kmer-based approach for precise estimation of molecule lengths. It then formulates the assembly problem as finding an optimal path that connects the two nodes determined by anchors in the underlying compact de Bruijn graph. The optimality is defined as maximizing the weight of the smallest node while matching the estimated sequence length. Anchorage uses a modified dynamic programming algorithm to efficiently find the optimal path. Through both simulations and real data, we show that Anchorage outperforms existing assembly methods, particularly in the presence of sequencing artifacts. Anchorage fills the gap in assembling anchor-enabled data. We anticipate its broad use as anchor-enabled sequencing technologies become prevalent. Anchorage is freely available at https://github.com/Shao-Group/anchorage ; the scripts and documents that can reproduce all experiments in this manuscript are available at https://github.com/Shao-Group/anchorage-test .

现代测序技术允许在捕获分子的两端添加短序列标签,称为锚点。锚点在组装捕获分子的全长序列时是有用的,因为它们可以用来准确地确定端点。这种锚定技术的一个代表是LoopSeq Solo,一种合成长读(SLR)测序协议。LoopSeq Solo还实现了覆盖整个捕获分子的超高测序深度和高纯度的短读。尽管有许多装配方法可用,但由于底层装配图的复杂性和缺乏利用锚点的特定算法,从这些锚点支持的超高覆盖率测序数据构建全长序列仍然具有挑战性。我们提出了Anchorage,一种新型装配器,可执行锚导装配超高深度测序数据。安克雷奇开始与基于kmer精确估计分子长度的方法。然后,它将装配问题表述为寻找连接两个节点的最优路径,这两个节点由底层紧凑de Bruijn图中的锚点确定。最优性定义为在匹配估计序列长度的同时使最小节点的权值最大化。Anchorage采用一种改进的动态规划算法来高效地寻找最优路径。通过模拟和实际数据,我们表明Anchorage优于现有的装配方法,特别是在存在测序工件的情况下。锚固填补了收集锚固数据的空白。随着锚定测序技术的普及,我们预计其将得到广泛应用。安克雷奇可以免费访问https://github.com/Shao-Group/anchorage;可以复制本手稿中所有实验的脚本和文档可在https://github.com/Shao-Group/anchorage-test上获得。
{"title":"Anchorage accurately assembles anchor-flanked synthetic long reads.","authors":"Xiaofei Carl Zang, Xiang Li, Kyle Metcalfe, Tuval Ben-Yehezkel, Ryan Kelley, Mingfu Shao","doi":"10.1186/s13015-025-00288-4","DOIUrl":"10.1186/s13015-025-00288-4","url":null,"abstract":"<p><p>Modern sequencing technologies allow for the addition of short-sequence tags, known as anchors, to both ends of a captured molecule. Anchors are useful in assembling the full-length sequence of a captured molecule as they can be used to accurately determine the endpoints. One representative of such anchor-enabled technology is LoopSeq Solo, a synthetic long read (SLR) sequencing protocol. LoopSeq Solo also achieves ultra-high sequencing depth and high purity of short reads covering the entire captured molecule. Despite the availability of many assembly methods, constructing full-length sequence from these anchor-enabled, ultra-high coverage sequencing data remains challenging due to the complexity of the underlying assembly graphs and the lack of specific algorithms leveraging anchors. We present Anchorage, a novel assembler that performs anchor-guided assembly for ultra-high-depth sequencing data. Anchorage starts with a kmer-based approach for precise estimation of molecule lengths. It then formulates the assembly problem as finding an optimal path that connects the two nodes determined by anchors in the underlying compact de Bruijn graph. The optimality is defined as maximizing the weight of the smallest node while matching the estimated sequence length. Anchorage uses a modified dynamic programming algorithm to efficiently find the optimal path. Through both simulations and real data, we show that Anchorage outperforms existing assembly methods, particularly in the presence of sequencing artifacts. Anchorage fills the gap in assembling anchor-enabled data. We anticipate its broad use as anchor-enabled sequencing technologies become prevalent. Anchorage is freely available at https://github.com/Shao-Group/anchorage ; the scripts and documents that can reproduce all experiments in this manuscript are available at https://github.com/Shao-Group/anchorage-test .</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"12"},"PeriodicalIF":1.5,"publicationDate":"2025-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12232771/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144576879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Faster computation of left-bounded shortest unique substrings. 更快的计算左有界最短唯一子串。
IF 1.5 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-06-20 DOI: 10.1186/s13015-025-00287-5
Larissa L M Aguiar, Felipe A Louza

Finding shortest unique substrings (SUS) is a fundamental problem in string processing with applications in bioinformatics. In this paper, we present an algorithm for solving a variant of the SUS problem, the left-bounded shortest unique substrings (LSUS). This variant is particularly important in applications such as PCR primer design. Our algorithm runs in O(n) time using 2n memory words plus n bytes for an input string of length n. Experimental results with real and artificial datasets show that our algorithm is the fastest alternative in practice, being two times faster (on the average) than related works, while using a similar peak memory footprint.

寻找最短唯一子串(SUS)是生物信息学中字符串处理的一个基本问题。本文提出了一种求解SUS问题的变体——左有界最短唯一子串(LSUS)的算法。这种变体在PCR引物设计等应用中尤为重要。对于长度为n的输入字符串,我们的算法使用2n个存储字加上n个字节,在O(n)时间内运行。真实和人工数据集的实验结果表明,我们的算法在实践中是最快的替代方案,在使用相似的峰值内存占用时,(平均)比相关工作快两倍。
{"title":"Faster computation of left-bounded shortest unique substrings.","authors":"Larissa L M Aguiar, Felipe A Louza","doi":"10.1186/s13015-025-00287-5","DOIUrl":"10.1186/s13015-025-00287-5","url":null,"abstract":"<p><p>Finding shortest unique substrings (SUS) is a fundamental problem in string processing with applications in bioinformatics. In this paper, we present an algorithm for solving a variant of the SUS problem, the left-bounded shortest unique substrings (LSUS). This variant is particularly important in applications such as PCR primer design. Our algorithm runs in O(n) time using 2n memory words plus n bytes for an input string of length n. Experimental results with real and artificial datasets show that our algorithm is the fastest alternative in practice, being two times faster (on the average) than related works, while using a similar peak memory footprint.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"11"},"PeriodicalIF":1.5,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12181909/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144337195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reconstructing rearrangement phylogenies of natural genomes. 重建自然基因组重排系统发育。
IF 1.5 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-06-07 DOI: 10.1186/s13015-025-00279-5
Leonard Bohnenkämper, Jens Stoye, Daniel Doerr

Background: We study the classical problem of inferring ancestral genomes from a set of extant genomes under a given phylogeny, known as the Small Parsimony Problem (SPP). Genomes are represented as sequences of oriented markers, organized in one or more linear or circular chromosomes. Any marker may appear in several copies, without restriction on orientation or genomic location, known as the natural genomes model. Evolutionary events along the branches of the phylogeny encompass large scale rearrangements, including segmental inversions, translocations, gain and loss (DCJ-indel model). Even under simpler rearrangement models, such as the classical breakpoint model without duplicates, the SPP is computationally intractable. Nevertheless, the SPP for natural genomes under the DCJ-indel model has been studied recently, with limited success.

Methods: Building on prior work, we present a highly optimized ILP that is able to solve the SPP for sufficiently small phylogenies and gene families. A notable improvement w.r.t. the previous result is an optimized way of handling both circular and linear chromosomes. This is especially relevant to the SPP, since the chromosomal structure of ancestral genomes is unknown and the solution space for this chromosomal structure is typically large.

Results: We benchmark our method on simulated and real data. On simulated phylogenies we observe a considerable performance improvement on problems that include linear chromosomes. And even when the ground truth contains only one circular chromosome per genome, our method outperforms its predecessor due to its optimized handling of the solution space. The practical advantage becomes also visible in an analysis of seven Anopheles taxa.

背景:我们研究了从给定系统发育下的一组现存基因组推断祖先基因组的经典问题,称为小简约问题(SPP)。基因组表示为定向标记序列,组织在一个或多个线性或圆形染色体中。任何标记都可以出现在多个副本中,不受方向或基因组位置的限制,称为自然基因组模型。沿着系统发育分支的进化事件包括大规模的重排,包括片段倒置、易位、获得和损失(DCJ-indel模型)。即使在更简单的重排模型下,例如没有重复的经典断点模型,SPP在计算上也是难以处理的。尽管如此,在DCJ-indel模型下对自然基因组的SPP进行了研究,但收效甚微。方法:在先前工作的基础上,我们提出了一个高度优化的ILP,能够解决足够小的系统发育和基因家族的SPP问题。与之前的结果相比,一个显著的改进是处理圆形和线形染色体的优化方法。这与SPP特别相关,因为祖先基因组的染色体结构是未知的,而且这种染色体结构的解空间通常很大。结果:我们在模拟和真实数据上对我们的方法进行了基准测试。在模拟系统发育中,我们观察到在包含线性染色体的问题上有相当大的性能改进。即使当每个基因组只包含一个圆形染色体时,我们的方法也优于其前身,因为它优化了对解空间的处理。在对7个按蚊分类群的分析中,实际优势也变得明显。
{"title":"Reconstructing rearrangement phylogenies of natural genomes.","authors":"Leonard Bohnenkämper, Jens Stoye, Daniel Doerr","doi":"10.1186/s13015-025-00279-5","DOIUrl":"10.1186/s13015-025-00279-5","url":null,"abstract":"<p><strong>Background: </strong>We study the classical problem of inferring ancestral genomes from a set of extant genomes under a given phylogeny, known as the Small Parsimony Problem (SPP). Genomes are represented as sequences of oriented markers, organized in one or more linear or circular chromosomes. Any marker may appear in several copies, without restriction on orientation or genomic location, known as the natural genomes model. Evolutionary events along the branches of the phylogeny encompass large scale rearrangements, including segmental inversions, translocations, gain and loss (DCJ-indel model). Even under simpler rearrangement models, such as the classical breakpoint model without duplicates, the SPP is computationally intractable. Nevertheless, the SPP for natural genomes under the DCJ-indel model has been studied recently, with limited success.</p><p><strong>Methods: </strong>Building on prior work, we present a highly optimized ILP that is able to solve the SPP for sufficiently small phylogenies and gene families. A notable improvement w.r.t. the previous result is an optimized way of handling both circular and linear chromosomes. This is especially relevant to the SPP, since the chromosomal structure of ancestral genomes is unknown and the solution space for this chromosomal structure is typically large.</p><p><strong>Results: </strong>We benchmark our method on simulated and real data. On simulated phylogenies we observe a considerable performance improvement on problems that include linear chromosomes. And even when the ground truth contains only one circular chromosome per genome, our method outperforms its predecessor due to its optimized handling of the solution space. The practical advantage becomes also visible in an analysis of seven Anopheles taxa.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"10"},"PeriodicalIF":1.5,"publicationDate":"2025-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12144824/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144250682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sama: a contig assembler with correctness guarantee. Sama:具有正确性保证的配置汇编程序。
IF 1.5 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-06-03 DOI: 10.1186/s13015-025-00280-y
Leena Salmela

Background: In genome assembly the task is to reconstruct a genome based on sequencing reads. Current practical methods are based on heuristics which are hard to analyse and thus such analysis is not readily available.

Results: We present a model for estimating the probability of misassembly at each position of a de Bruijn graph based assembly. Unlike previous work, our model also takes into account missing data. We apply our model to produce contigs with correctness guarantee and correctness estimates for each position in the contigs.

Conclusions: Our experiments show that when the coverage of k-mers is high enough, our method produces contigs with similar contiguity characteristics as state-of-the-art assemblers which are based on heuristic correction of the de Bruijn graph. Our model may have further applications in downstream analysis of contigs or in any analysis working directly on the de Bruijn graph.

背景:基因组组装的任务是基于测序reads重建基因组。目前的实用方法是基于启发式的,难以分析,因此这种分析并不容易获得。结果:我们提出了一个模型,用于估计在基于de Bruijn图的装配的每个位置的错装配概率。与以前的工作不同,我们的模型还考虑了缺失数据。我们应用我们的模型来产生具有正确性保证和正确性估计的组合。结论:我们的实验表明,当k-mers的覆盖率足够高时,我们的方法产生的组合具有与基于de Bruijn图的启发式校正的最先进的组装器相似的邻近特征。我们的模型可以进一步应用于配置的下游分析或直接在德布鲁因图上工作的任何分析。
{"title":"Sama: a contig assembler with correctness guarantee.","authors":"Leena Salmela","doi":"10.1186/s13015-025-00280-y","DOIUrl":"10.1186/s13015-025-00280-y","url":null,"abstract":"<p><strong>Background: </strong>In genome assembly the task is to reconstruct a genome based on sequencing reads. Current practical methods are based on heuristics which are hard to analyse and thus such analysis is not readily available.</p><p><strong>Results: </strong>We present a model for estimating the probability of misassembly at each position of a de Bruijn graph based assembly. Unlike previous work, our model also takes into account missing data. We apply our model to produce contigs with correctness guarantee and correctness estimates for each position in the contigs.</p><p><strong>Conclusions: </strong>Our experiments show that when the coverage of k-mers is high enough, our method produces contigs with similar contiguity characteristics as state-of-the-art assemblers which are based on heuristic correction of the de Bruijn graph. Our model may have further applications in downstream analysis of contigs or in any analysis working directly on the de Bruijn graph.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"9"},"PeriodicalIF":1.5,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12135590/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144217466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Estimating similarity and distance using FracMinHash. 使用FracMinHash估计相似度和距离。
IF 1.5 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-05-15 DOI: 10.1186/s13015-025-00276-8
Mahmudur Rahman Hera, David Koslicki

Motivation: The increasing number and volume of genomic and metagenomic data necessitates scalable and robust computational models for precise analysis. Sketching techniques utilizing k -mers from a biological sample have proven to be useful for large-scale analyses. In recent years, FracMinHash has emerged as a popular sketching technique and has been used in several useful applications. Recent studies on FracMinHash proved unbiased estimators for the containment and Jaccard indices. However, theoretical investigations for other metrics are still lacking.

Theoretical contributions: In this paper, we present a theoretical framework for estimating similarity/distance metrics by using FracMinHash sketches, when the metric is expressible in a certain form. We establish conditions under which such an estimation is sound and recommend a minimum scale factor s for accurate results. Experimental evidence supports our theoretical findings.

Practical contributions: We also present frac-kmc, a fast and efficient FracMinHash sketch generator program. frac-kmc is the fastest known FracMinHash sketch generator, delivering accurate and precise results for cosine similarity estimation on real data. frac-kmc is also the first parallel tool for this task, allowing for speeding up sketch generation using multiple CPU cores - an option lacking in existing serialized tools. We show that by computing FracMinHash sketches using frac-kmc, we can estimate pairwise similarity speedily and accurately on real data. frac-kmc is freely available here: https://github.com/KoslickiLab/frac-kmc/.

动机:基因组和宏基因组数据的数量和体积的增加需要可扩展和强大的计算模型来进行精确分析。利用来自生物样品的k -mers的草图技术已被证明对大规模分析是有用的。近年来,FracMinHash已成为一种流行的素描技术,并已在几个有用的应用中使用。最近对FracMinHash的研究证明了包含指数和Jaccard指数的无偏估计。然而,对其他指标的理论研究仍然缺乏。理论贡献:在本文中,我们提出了一个理论框架,当度量可以以某种形式表示时,通过使用FracMinHash草图来估计相似性/距离度量。我们建立了这样的估计是合理的条件,并推荐了一个最小的比例因子为准确的结果。实验证据支持我们的理论发现。实际贡献:我们还提出了frac-kmc,一个快速高效的FracMinHash草图生成器程序。FracMinHash草图生成器是已知最快的FracMinHash草图生成器,为真实数据的余弦相似度估计提供准确和精确的结果。frackmc也是该任务的第一个并行工具,允许使用多个CPU内核加速草图生成——这是现有串行化工具所缺乏的选项。通过使用frackmc计算FracMinHash草图,我们可以快速准确地估计真实数据的两两相似度。水力压裂-kmc免费下载网址:https://github.com/KoslickiLab/frac-kmc/。
{"title":"Estimating similarity and distance using FracMinHash.","authors":"Mahmudur Rahman Hera, David Koslicki","doi":"10.1186/s13015-025-00276-8","DOIUrl":"10.1186/s13015-025-00276-8","url":null,"abstract":"<p><strong>Motivation: </strong>The increasing number and volume of genomic and metagenomic data necessitates scalable and robust computational models for precise analysis. Sketching techniques utilizing <math><mi>k</mi></math> -mers from a biological sample have proven to be useful for large-scale analyses. In recent years, FracMinHash has emerged as a popular sketching technique and has been used in several useful applications. Recent studies on FracMinHash proved unbiased estimators for the containment and Jaccard indices. However, theoretical investigations for other metrics are still lacking.</p><p><strong>Theoretical contributions: </strong>In this paper, we present a theoretical framework for estimating similarity/distance metrics by using FracMinHash sketches, when the metric is expressible in a certain form. We establish conditions under which such an estimation is sound and recommend a minimum scale factor s for accurate results. Experimental evidence supports our theoretical findings.</p><p><strong>Practical contributions: </strong>We also present frac-kmc, a fast and efficient FracMinHash sketch generator program. frac-kmc is the fastest known FracMinHash sketch generator, delivering accurate and precise results for cosine similarity estimation on real data. frac-kmc is also the first parallel tool for this task, allowing for speeding up sketch generation using multiple CPU cores - an option lacking in existing serialized tools. We show that by computing FracMinHash sketches using frac-kmc, we can estimate pairwise similarity speedily and accurately on real data. frac-kmc is freely available here: https://github.com/KoslickiLab/frac-kmc/.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"8"},"PeriodicalIF":1.5,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12082993/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144081838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AlfaPang: alignment free algorithm for pangenome graph construction. AlfaPang:用于泛基因组图构建的无对齐算法。
IF 1.5 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-05-15 DOI: 10.1186/s13015-025-00277-7
Adam Cicherski, Anna Lisiecka, Norbert Dojer

The success of pangenome-based approaches to genomics analysis depends largely on the existence of efficient methods for constructing pangenome graphs that are applicable to large genome collections. In the current paper we present AlfaPang, a new pangenome graph building algorithm. AlfaPang is based on a novel alignment-free approach that allows to construct pangenome graphs using significantly less computational resources than state-of-the-art tools. The code of AlfaPang is freely available at https://github.com/AdamCicherski/AlfaPang .

基于泛基因组的基因组学分析方法的成功在很大程度上取决于构建适用于大型基因组集合的泛基因组图的有效方法的存在。本文提出了一种新的泛基因组图谱构建算法AlfaPang。AlfaPang基于一种新颖的无对齐方法,可以使用比最先进的工具更少的计算资源来构建泛基因组图。AlfaPang的代码可以在https://github.com/AdamCicherski/AlfaPang上免费获得。
{"title":"AlfaPang: alignment free algorithm for pangenome graph construction.","authors":"Adam Cicherski, Anna Lisiecka, Norbert Dojer","doi":"10.1186/s13015-025-00277-7","DOIUrl":"10.1186/s13015-025-00277-7","url":null,"abstract":"<p><p>The success of pangenome-based approaches to genomics analysis depends largely on the existence of efficient methods for constructing pangenome graphs that are applicable to large genome collections. In the current paper we present AlfaPang, a new pangenome graph building algorithm. AlfaPang is based on a novel alignment-free approach that allows to construct pangenome graphs using significantly less computational resources than state-of-the-art tools. The code of AlfaPang is freely available at https://github.com/AdamCicherski/AlfaPang .</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"7"},"PeriodicalIF":1.5,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12082865/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144081831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
M C D A G : indexing maximal common subsequences for k strings. M C D A G:索引k个字符串的最大公共子序列。
IF 1.5 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-04-19 DOI: 10.1186/s13015-025-00271-z
Giovanni Buzzega, Alessio Conte, Roberto Grossi, Giulia Punzi

Analyzing and comparing sequences of symbols is among the most fundamental problems in computer science, possibly even more so in bioinformatics. Maximal Common Subsequences (MCSs), i.e., inclusion-maximal sequences of non-contiguous symbols common to two or more strings, have only recently received attention in this area, despite being a basic notion and a natural generalization of more common tools like Longest Common Substrings/Subsequences. In this paper we simplify and engineer recent advancements in MCSs into a practical tool called M C D A G , the first publicly available tool that can index MCSs of real genomic data, and show that its definition can be generalized to multiple strings. We demonstrate that our tool can index pairs of sequences exceeding 10,000 base pairs within minutes, utilizing only 4-7% more than the minimum required nodes. For three or more sequences, we observe experimentally that the minimum index may exhibit a significant increase in the number of nodes.

分析和比较符号序列是计算机科学中最基本的问题之一,在生物信息学中可能更是如此。最大公共子序列(mcs),即两个或多个字符串共有的非连续符号的包含最大序列,直到最近才在该领域受到关注,尽管它是一个基本概念,也是最长公共子串/子序列等更常见工具的自然推广。在本文中,我们将mcs的最新进展简化和工程成一个实用的工具,称为mcs - C - D - a - G,这是第一个公开可用的工具,可以索引真实基因组数据的mcs,并表明其定义可以推广到多个字符串。我们证明,我们的工具可以在几分钟内索引超过10,000个碱基对的序列对,只使用比最小所需节点多4-7%的节点。对于三个或更多的序列,我们通过实验观察到,最小索引可能会显着增加节点数。
{"title":"<ArticleTitle xmlns:ns0=\"http://www.w3.org/1998/Math/MathML\"><ns0:math><ns0:mrow><ns0:mi>M</ns0:mi> <ns0:mstyle><ns0:mi>C</ns0:mi> <ns0:mi>D</ns0:mi> <ns0:mi>A</ns0:mi> <ns0:mi>G</ns0:mi></ns0:mstyle> </ns0:mrow> </ns0:math> : indexing maximal common subsequences for k strings.","authors":"Giovanni Buzzega, Alessio Conte, Roberto Grossi, Giulia Punzi","doi":"10.1186/s13015-025-00271-z","DOIUrl":"https://doi.org/10.1186/s13015-025-00271-z","url":null,"abstract":"<p><p>Analyzing and comparing sequences of symbols is among the most fundamental problems in computer science, possibly even more so in bioinformatics. Maximal Common Subsequences (MCSs), i.e., inclusion-maximal sequences of non-contiguous symbols common to two or more strings, have only recently received attention in this area, despite being a basic notion and a natural generalization of more common tools like Longest Common Substrings/Subsequences. In this paper we simplify and engineer recent advancements in MCSs into a practical tool called <math><mrow><mi>M</mi> <mstyle><mi>C</mi> <mi>D</mi> <mi>A</mi> <mi>G</mi></mstyle> </mrow> </math> , the first publicly available tool that can index MCSs of real genomic data, and show that its definition can be generalized to multiple strings. We demonstrate that our tool can index pairs of sequences exceeding 10,000 base pairs within minutes, utilizing only 4-7% more than the minimum required nodes. For three or more sequences, we observe experimentally that the minimum index may exhibit a significant increase in the number of nodes.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"6"},"PeriodicalIF":1.5,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12008955/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144042825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Algorithms for Molecular Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1