首页 > 最新文献

Algorithms for Molecular Biology最新文献

英文 中文
Space-efficient computation of k-mer dictionaries for large values of k 针对大 k 值的 k-mer 字典的空间高效计算
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-04-05 DOI: 10.1186/s13015-024-00259-1
Diego Díaz-Domínguez, Miika Leinonen, Leena Salmela
Computing k-mer frequencies in a collection of reads is a common procedure in many genomic applications. Several state-of-the-art k-mer counters rely on hash tables to carry out this task but they are often optimised for small k as a hash table keeping keys explicitly (i.e., k-mer sequences) takes $$O(Nfrac{k}{w})$$ computer words, N being the number of distinct k-mers and w the computer word size, which is impractical for long values of k. This space usage is an important limitation as analysis of long and accurate HiFi sequencing reads can require larger values of k. We propose Kaarme, a space-efficient hash table for k-mers using $$O(N+ufrac{k}{w})$$ words of space, where u is the number of reads. Our framework exploits the fact that consecutive k-mers overlap by $$k-1$$ symbols. Thus, we only store the last symbol of a k-mer and a pointer within the hash table to a previous one, which we can use to recover the remaining $$k-1$$ symbols. We adapt Kaarme to compute canonical k-mers as well. This variant also uses pointers within the hash table to save space but requires more work to decode the k-mers. Specifically, it takes $$O(sigma ^{k})$$ time in the worst case, $$sigma$$ being the DNA alphabet, but our experiments show this is hardly ever the case. The canonical variant does not improve our theoretical results but greatly reduces space usage in practice while keeping a competitive performance to get the k-mers and their frequencies. We compare canonical Kaarme to a regular hash table storing canonical k-mers explicitly as keys and show that our method uses up to five times less space while being less than 1.5 times slower. We also show that canonical Kaarme uses significantly less memory than state-of-the-art k-mer counters when they do not resort to disk to keep intermediate results.
计算读数集合中的 k-mer 频率是许多基因组应用中的常见程序。几种最先进的 k-mer计数器依靠哈希表来完成这项任务,但它们通常针对小k进行优化,因为哈希表明确保存键(即k-mer序列)需要花费$$O(Nfrac{k}{w})$$的计算机字数、我们提出的 Kaarme 是一种空间效率高的 k-mers 哈希表,只需 $$O(N+u/frac{k}{w})$$个字的空间,其中 u 是读数的数量。我们的框架利用了这样一个事实,即连续的 k-mers 重叠了 $$k-1$$ 个符号。因此,我们只存储 k-mer的最后一个符号和哈希表中指向前一个符号的指针,我们可以用它来恢复剩余的$k-1$$符号。我们对 Kaarme 进行了调整,使其也能计算规范 k-mer。这种变体也使用哈希表中的指针来节省空间,但需要更多的工作来解码 k-mers。具体来说,在最糟糕的情况下,它需要花费 $$O(sigma^{k})$$时间,$$sigma$$$是 DNA 字母表,但我们的实验表明这种情况几乎不存在。规范变体并没有改善我们的理论结果,但在实践中大大减少了空间使用,同时在获取 k-mers 及其频率方面保持了极具竞争力的性能。我们比较了规范 Kaarme 和明确将规范 k-mers 作为键存储的普通哈希表,结果表明我们的方法占用的空间减少了五倍,而速度却慢了不到 1.5 倍。我们还证明,当不使用磁盘保存中间结果时,规范 Kaarme 使用的内存比最先进的 k-mer 计数器少得多。
{"title":"Space-efficient computation of k-mer dictionaries for large values of k","authors":"Diego Díaz-Domínguez, Miika Leinonen, Leena Salmela","doi":"10.1186/s13015-024-00259-1","DOIUrl":"https://doi.org/10.1186/s13015-024-00259-1","url":null,"abstract":"Computing k-mer frequencies in a collection of reads is a common procedure in many genomic applications. Several state-of-the-art k-mer counters rely on hash tables to carry out this task but they are often optimised for small k as a hash table keeping keys explicitly (i.e., k-mer sequences) takes $$O(Nfrac{k}{w})$$ computer words, N being the number of distinct k-mers and w the computer word size, which is impractical for long values of k. This space usage is an important limitation as analysis of long and accurate HiFi sequencing reads can require larger values of k. We propose Kaarme, a space-efficient hash table for k-mers using $$O(N+ufrac{k}{w})$$ words of space, where u is the number of reads. Our framework exploits the fact that consecutive k-mers overlap by $$k-1$$ symbols. Thus, we only store the last symbol of a k-mer and a pointer within the hash table to a previous one, which we can use to recover the remaining $$k-1$$ symbols. We adapt Kaarme to compute canonical k-mers as well. This variant also uses pointers within the hash table to save space but requires more work to decode the k-mers. Specifically, it takes $$O(sigma ^{k})$$ time in the worst case, $$sigma$$ being the DNA alphabet, but our experiments show this is hardly ever the case. The canonical variant does not improve our theoretical results but greatly reduces space usage in practice while keeping a competitive performance to get the k-mers and their frequencies. We compare canonical Kaarme to a regular hash table storing canonical k-mers explicitly as keys and show that our method uses up to five times less space while being less than 1.5 times slower. We also show that canonical Kaarme uses significantly less memory than state-of-the-art k-mer counters when they do not resort to disk to keep intermediate results.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"31 1","pages":""},"PeriodicalIF":1.0,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140583688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Infrared: a declarative tree decomposition-powered framework for bioinformatics. 红外线:一种用于生物信息学的声明式树分解框架。
IF 1.5 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-03-16 DOI: 10.1186/s13015-024-00258-2
Hua-Ting Yao, Bertrand Marchand, Sarah J Berkemer, Yann Ponty, Sebastian Will

Motivation: Many bioinformatics problems can be approached as optimization or controlled sampling tasks, and solved exactly and efficiently using Dynamic Programming (DP). However, such exact methods are typically tailored towards specific settings, complex to develop, and hard to implement and adapt to problem variations.

Methods: We introduce the Infrared framework to overcome such hindrances for a large class of problems. Its underlying paradigm is tailored toward problems that can be declaratively formalized as sparse feature networks, a generalization of constraint networks. Classic Boolean constraints specify a search space, consisting of putative solutions whose evaluation is performed through a combination of features. Problems are then solved using generic cluster tree elimination algorithms over a tree decomposition of the feature network. Their overall complexities are linear on the number of variables, and only exponential in the treewidth of the feature network. For sparse feature networks, associated with low to moderate treewidths, these algorithms allow to find optimal solutions, or generate controlled samples, with practical empirical efficiency.

Results: Implementing these methods, the Infrared software allows Python programmers to rapidly develop exact optimization and sampling applications based on a tree decomposition-based efficient processing. Instead of directly coding specialized algorithms, problems are declaratively modeled as sets of variables over finite domains, whose dependencies are captured by constraints and functions. Such models are then automatically solved by generic DP algorithms. To illustrate the applicability of Infrared in bioinformatics and guide new users, we model and discuss variants of bioinformatics applications. We provide reimplementations and extensions of methods for RNA design, RNA sequence-structure alignment, parsimony-driven inference of ancestral traits in phylogenetic trees/networks, and design of coding sequences. Moreover, we demonstrate multidimensional Boltzmann sampling. These applications of the framework-together with our novel results-underline the practical relevance of Infrared. Remarkably, the achieved complexities are typically equivalent to the ones of specialized algorithms and implementations.

Availability: Infrared is available at https://amibio.gitlabpages.inria.fr/Infrared with extensive documentation, including various usage examples and API reference; it can be installed using Conda or from source.

动机许多生物信息学问题都可以作为优化或受控采样任务来处理,并使用动态编程(Dynamic Programming,DP)精确高效地解决。然而,这种精确方法通常是针对特定环境定制的,开发起来很复杂,而且难以实施和适应问题的变化:我们引入了 Infrared 框架,以克服这类问题的障碍。该框架的基本范式是针对可以声明地形式化为稀疏特征网络(约束网络的一种概括)的问题而量身定制的。经典的布尔约束指定了一个搜索空间,该空间由推测的解决方案组成,通过特征组合对解决方案进行评估。然后,在特征网络的树形分解上使用通用的簇树消除算法来解决问题。这些算法的总体复杂度与变量数量呈线性关系,与特征网络的树宽呈指数关系。对于中低树宽的稀疏特征网络,这些算法可以找到最优解,或生成受控样本,具有实用的经验效率:利用这些方法,Infrared 软件允许 Python 程序员在基于树分解的高效处理基础上快速开发精确优化和采样应用程序。问题不是直接编码专门算法,而是声明性地建模为有限域上的变量集,其依赖关系由约束和函数捕获。然后,通用 DP 算法会自动解决这些模型。为了说明红外技术在生物信息学中的适用性并指导新用户,我们对生物信息学应用的变体进行了建模和讨论。我们对 RNA 设计、RNA 序列结构比对、系统发生树/网络中祖先性状的解析驱动推断以及编码序列设计等方法进行了重新实施和扩展。此外,我们还演示了多维玻尔兹曼采样。该框架的这些应用以及我们的新成果凸显了红外技术的实用性。值得注意的是,所实现的复杂性通常等同于专门算法和实现的复杂性:Infrared 可在 https://amibio.gitlabpages.inria.fr/Infrared 网站上获取,并附有大量文档,包括各种使用示例和 API 参考;可使用 Conda 或从源代码安装。
{"title":"Infrared: a declarative tree decomposition-powered framework for bioinformatics.","authors":"Hua-Ting Yao, Bertrand Marchand, Sarah J Berkemer, Yann Ponty, Sebastian Will","doi":"10.1186/s13015-024-00258-2","DOIUrl":"10.1186/s13015-024-00258-2","url":null,"abstract":"<p><strong>Motivation: </strong>Many bioinformatics problems can be approached as optimization or controlled sampling tasks, and solved exactly and efficiently using Dynamic Programming (DP). However, such exact methods are typically tailored towards specific settings, complex to develop, and hard to implement and adapt to problem variations.</p><p><strong>Methods: </strong>We introduce the Infrared framework to overcome such hindrances for a large class of problems. Its underlying paradigm is tailored toward problems that can be declaratively formalized as sparse feature networks, a generalization of constraint networks. Classic Boolean constraints specify a search space, consisting of putative solutions whose evaluation is performed through a combination of features. Problems are then solved using generic cluster tree elimination algorithms over a tree decomposition of the feature network. Their overall complexities are linear on the number of variables, and only exponential in the treewidth of the feature network. For sparse feature networks, associated with low to moderate treewidths, these algorithms allow to find optimal solutions, or generate controlled samples, with practical empirical efficiency.</p><p><strong>Results: </strong>Implementing these methods, the Infrared software allows Python programmers to rapidly develop exact optimization and sampling applications based on a tree decomposition-based efficient processing. Instead of directly coding specialized algorithms, problems are declaratively modeled as sets of variables over finite domains, whose dependencies are captured by constraints and functions. Such models are then automatically solved by generic DP algorithms. To illustrate the applicability of Infrared in bioinformatics and guide new users, we model and discuss variants of bioinformatics applications. We provide reimplementations and extensions of methods for RNA design, RNA sequence-structure alignment, parsimony-driven inference of ancestral traits in phylogenetic trees/networks, and design of coding sequences. Moreover, we demonstrate multidimensional Boltzmann sampling. These applications of the framework-together with our novel results-underline the practical relevance of Infrared. Remarkably, the achieved complexities are typically equivalent to the ones of specialized algorithms and implementations.</p><p><strong>Availability: </strong>Infrared is available at https://amibio.gitlabpages.inria.fr/Infrared with extensive documentation, including various usage examples and API reference; it can be installed using Conda or from source.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"13"},"PeriodicalIF":1.5,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10943887/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140141081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Median quartet tree search algorithms using optimal subtree prune and regraft. 使用最优子树修剪和重植的中位四叉树搜索算法。
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-03-13 DOI: 10.1186/s13015-024-00257-3
Shayesteh Arasti, Siavash Mirarab

Gene trees can be different from the species tree due to biological processes and inference errors. One way to obtain a species tree is to find one that maximizes some measure of similarity to a set of gene trees. The number of shared quartets between a potential species tree and gene trees provides a statistically justifiable score; if maximized properly, it could result in a statistically consistent estimator of the species tree under several statistical models of discordance. However, finding the median quartet score tree, one that maximizes this score, is NP-Hard, motivating several existing heuristic algorithms. These heuristics do not follow the hill-climbing paradigm used extensively in phylogenetics. In this paper, we make theoretical contributions that enable an efficient hill-climbing approach. Specifically, we show that a subtree of size m can be placed optimally on a tree of size n in quasi-linear time with respect to n and (almost) independently of m. This result enables us to perform subtree prune and regraft (SPR) rearrangements as part of a hill-climbing search. We show that this approach can slightly improve upon the results of widely-used methods such as ASTRAL in terms of the optimization score but not necessarily accuracy.

由于生物过程和推断错误,基因树可能与物种树不同。获得物种树的一种方法是找到一棵与一组基因树相似度最大的树。潜在的物种树与基因树之间共享四分位点的数量提供了一个统计上合理的分数;如果正确地最大化,在几种不一致的统计模型下,它可以产生一个统计上一致的物种树估计值。然而,寻找中位四分树,即最大化该分数的树,是一个 NP-困难的问题,这也是现有几种启发式算法的动机。这些启发式算法并不遵循系统发生学中广泛使用的爬山模式。在本文中,我们的理论贡献使得高效的爬坡方法成为可能。具体来说,我们证明了大小为 m 的子树可以在与 n 有关的准线性时间内以最佳方式放置在大小为 n 的树上,并且(几乎)与 m 无关。这一结果使我们能够在爬山搜索中执行子树修剪和重植(SPR)重新排列。我们的研究表明,与 ASTRAL 等广泛使用的方法相比,这种方法能在优化得分方面略有提高,但不一定能提高准确性。
{"title":"Median quartet tree search algorithms using optimal subtree prune and regraft.","authors":"Shayesteh Arasti, Siavash Mirarab","doi":"10.1186/s13015-024-00257-3","DOIUrl":"10.1186/s13015-024-00257-3","url":null,"abstract":"<p><p>Gene trees can be different from the species tree due to biological processes and inference errors. One way to obtain a species tree is to find one that maximizes some measure of similarity to a set of gene trees. The number of shared quartets between a potential species tree and gene trees provides a statistically justifiable score; if maximized properly, it could result in a statistically consistent estimator of the species tree under several statistical models of discordance. However, finding the median quartet score tree, one that maximizes this score, is NP-Hard, motivating several existing heuristic algorithms. These heuristics do not follow the hill-climbing paradigm used extensively in phylogenetics. In this paper, we make theoretical contributions that enable an efficient hill-climbing approach. Specifically, we show that a subtree of size m can be placed optimally on a tree of size n in quasi-linear time with respect to n and (almost) independently of m. This result enables us to perform subtree prune and regraft (SPR) rearrangements as part of a hill-climbing search. We show that this approach can slightly improve upon the results of widely-used methods such as ASTRAL in terms of the optimization score but not necessarily accuracy.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"12"},"PeriodicalIF":1.0,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10938725/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140121325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Suffix sorting via matching statistics. 通过匹配统计进行后缀排序
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-03-12 DOI: 10.1186/s13015-023-00245-z
Zsuzsanna Lipták, Francesco Masillo, Simon J Puglisi

We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a reference string. We then use this data structure to distribute suffixes into a partial order, and subsequently to speed up suffix comparisons to complete the generalized suffix array. Our experimental evidence with a prototype implementation (a tool we call sacamats) shows that on string collections with highly similar strings we can construct the suffix array in time competitive with or faster than the fastest available methods. Along the way, we describe a heuristic for fast computation of the matching statistics of two strings, which may be of independent interest.

我们引入了一种新算法,用于构建高度相似字符串集合的广义后缀数组。第一步,我们构建了一个压缩表示,表示该集合相对于参考字符串的匹配统计数据。然后,我们使用这种数据结构将后缀分配到部分顺序中,随后加快后缀比较,以完成广义后缀数组。我们使用原型实现(我们称之为 sacamats 的工具)进行的实验证明,在具有高度相似字符串的字符串集合上,我们构建后缀数组的时间可以与现有的最快方法相媲美,甚至更快。同时,我们还介绍了一种快速计算两个字符串匹配统计量的启发式方法,这可能也是我们感兴趣的地方。
{"title":"Suffix sorting via matching statistics.","authors":"Zsuzsanna Lipták, Francesco Masillo, Simon J Puglisi","doi":"10.1186/s13015-023-00245-z","DOIUrl":"10.1186/s13015-023-00245-z","url":null,"abstract":"<p><p>We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a reference string. We then use this data structure to distribute suffixes into a partial order, and subsequently to speed up suffix comparisons to complete the generalized suffix array. Our experimental evidence with a prototype implementation (a tool we call sacamats) shows that on string collections with highly similar strings we can construct the suffix array in time competitive with or faster than the fastest available methods. Along the way, we describe a heuristic for fast computation of the matching statistics of two strings, which may be of independent interest.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"11"},"PeriodicalIF":1.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10935992/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140112116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Finding maximal exact matches in graphs 在图中寻找最大精确匹配
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-03-11 DOI: 10.1186/s13015-024-00255-5
Nicola Rizzo, Manuel Cáceres, Veli Mäkinen
We study the problem of finding maximal exact matches (MEMs) between a query string Q and a labeled graph G. MEMs are an important class of seeds, often used in seed-chain-extend type of practical alignment methods because of their strong connections to classical metrics. A principled way to speed up chaining is to limit the number of MEMs by considering only MEMs of length at least $$kappa$$ ( $$kappa$$ -MEMs). However, on arbitrary input graphs, the problem of finding MEMs cannot be solved in truly sub-quadratic time under SETH (Equi et al., TALG 2023) even on acyclic graphs. In this paper we show an $$O(ncdot L cdot d^{L-1} + m + M_{kappa ,L})$$ -time algorithm finding all $$kappa$$ -MEMs between Q and G spanning exactly L nodes in G, where n is the total length of node labels, d is the maximum degree of a node in G, $$m = |Q|$$ , and $$M_{kappa ,L}$$ is the number of output MEMs. We use this algorithm to develop a $$kappa$$ -MEM finding solution on indexable Elastic Founder Graphs (Equi et al., Algorithmica 2022) running in time $$O(nH^2 + m + M_kappa )$$ , where H is the maximum number of nodes in a block, and $$M_kappa$$ is the total number of $$kappa$$ -MEMs. Our results generalize to the analysis of multiple query strings (MEMs between G and any of the strings). Additionally, we provide some experimental results showing that the number of graph MEMs is an order of magnitude smaller than the number of string MEMs of the corresponding concatenated collection. We show that seed-chain-extend type of alignment methods can be implemented on top of indexable Elastic Founder Graphs by providing an efficient way to produce the seeds between a set of queries and the graph. The code is available in https://github.com/algbio/efg-mems .
我们研究的问题是在查询字符串 Q 和标记图 G 之间寻找最大精确匹配(MEMs)。MEMs 是一类重要的种子,由于其与经典度量标准的紧密联系,经常被用于种子链扩展类型的实用配准方法。加快链式排列的一个原则性方法是限制 MEM 的数量,只考虑长度至少为 $$kappa$$ 的 MEM($$kappa$$ -MEM)。然而,在任意输入图上,即使是无环图,也无法在 SETH(Equi 等人,TALG 2023)下以真正的亚二次方时间解决寻找 MEMs 的问题。在本文中,我们展示了一种 $$O(ncdot L cdot d^{L-1} + m + M_{{kappa ,L})$$ 时的算法,可以找到 Q 和 G 之间的所有 $$kappa$ -MEM,它们正好跨越 G 中的 L 个节点,其中 n 是节点标签的总长度,d 是 G 中节点的最大度数,$$m = |Q|$$ ,$$M_{kappa ,L}$ 是输出 MEM 的数量。我们使用该算法在可索引的弹性方正图(Equi et al., Algorithmica 2022)上开发了一个 $$$kappa$ -MEM 查找解决方案,运行时间为 $$O(nH^2+m+M_kappa)$$,其中 H 是块中节点的最大数量,$$M_kappa$$ 是 $$kappa$ -MEM 的总数。我们的结果可以推广到多个查询字符串(G 与任意字符串之间的 MEM)的分析。此外,我们还提供了一些实验结果,表明图 MEMs 的数量比相应串联集合的字符串 MEMs 数量要少一个数量级。我们展示了种子链扩展类型的对齐方法,通过提供在一组查询和图之间生成种子的有效方法,可以在可索引的弹性方正图之上实现。代码见 https://github.com/algbio/efg-mems。
{"title":"Finding maximal exact matches in graphs","authors":"Nicola Rizzo, Manuel Cáceres, Veli Mäkinen","doi":"10.1186/s13015-024-00255-5","DOIUrl":"https://doi.org/10.1186/s13015-024-00255-5","url":null,"abstract":"We study the problem of finding maximal exact matches (MEMs) between a query string Q and a labeled graph G. MEMs are an important class of seeds, often used in seed-chain-extend type of practical alignment methods because of their strong connections to classical metrics. A principled way to speed up chaining is to limit the number of MEMs by considering only MEMs of length at least $$kappa$$ ( $$kappa$$ -MEMs). However, on arbitrary input graphs, the problem of finding MEMs cannot be solved in truly sub-quadratic time under SETH (Equi et al., TALG 2023) even on acyclic graphs. In this paper we show an $$O(ncdot L cdot d^{L-1} + m + M_{kappa ,L})$$ -time algorithm finding all $$kappa$$ -MEMs between Q and G spanning exactly L nodes in G, where n is the total length of node labels, d is the maximum degree of a node in G, $$m = |Q|$$ , and $$M_{kappa ,L}$$ is the number of output MEMs. We use this algorithm to develop a $$kappa$$ -MEM finding solution on indexable Elastic Founder Graphs (Equi et al., Algorithmica 2022) running in time $$O(nH^2 + m + M_kappa )$$ , where H is the maximum number of nodes in a block, and $$M_kappa$$ is the total number of $$kappa$$ -MEMs. Our results generalize to the analysis of multiple query strings (MEMs between G and any of the strings). Additionally, we provide some experimental results showing that the number of graph MEMs is an order of magnitude smaller than the number of string MEMs of the corresponding concatenated collection. We show that seed-chain-extend type of alignment methods can be implemented on top of indexable Elastic Founder Graphs by providing an efficient way to produce the seeds between a set of queries and the graph. The code is available in https://github.com/algbio/efg-mems .","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"40 1","pages":""},"PeriodicalIF":1.0,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140098576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SparseRNAfolD: optimized sparse RNA pseudoknot-free folding with dangle consideration. SparseRNAfolD:经过优化的稀疏 RNA 无假结折叠,并考虑了悬垂因素。
IF 1.5 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-03-03 DOI: 10.1186/s13015-024-00256-4
Mateo Gray, Sebastian Will, Hosna Jabbari

Motivation: Computational RNA secondary structure prediction by free energy minimization is indispensable for analyzing structural RNAs and their interactions. These methods find the structure with the minimum free energy (MFE) among exponentially many possible structures and have a restrictive time and space complexity ( O ( n 3 ) time and O ( n 2 ) space for pseudoknot-free structures) for longer RNA sequences. Furthermore, accurate free energy calculations, including dangle contributions can be difficult and costly to implement, particularly when optimizing for time and space requirements.

Results: Here we introduce a fast and efficient sparsified MFE pseudoknot-free structure prediction algorithm, SparseRNAFolD, that utilizes an accurate energy model that accounts for dangle contributions. While the sparsification technique was previously employed to improve the time and space complexity of a pseudoknot-free structure prediction method with a realistic energy model, SparseMFEFold, it was not extended to include dangle contributions due to the complexity of computation. This may come at the cost of prediction accuracy. In this work, we compare three different sparsified implementations for dangle contributions and provide pros and cons of each method. As well, we compare our algorithm to LinearFold, a linear time and space algorithm, where we find that in practice, SparseRNAFolD has lower memory consumption across all lengths of sequence and a faster time for lengths up to 1000 bases.

Conclusion: Our SparseRNAFolD algorithm is an MFE-based algorithm that guarantees optimality of result and employs the most general energy model, including dangle contributions. We provide a basis for applying dangles to sparsified recursion in a pseudoknot-free model that has the potential to be extended to pseudoknots.

动机通过自由能最小化计算 RNA 二级结构预测是分析 RNA 结构及其相互作用不可或缺的方法。这些方法能在指数级的众多可能结构中找到自由能(MFE)最小的结构,而且对于较长的 RNA 序列来说,其时间和空间复杂度都很有限(对于无假结结构来说,时间为 O ( n 3 ) ,空间为 O ( n 2 ) )。此外,精确的自由能计算(包括悬垂贡献)可能难以实现且成本高昂,尤其是在优化时间和空间要求时:结果:在此,我们介绍了一种快速高效的稀疏化 MFE 无伪缺口结构预测算法 SparseRNAFolD,该算法采用了精确的能量模型,考虑了悬垂贡献。虽然稀疏化技术以前曾被用于提高采用现实能量模型的无伪结结构预测方法 SparseMFEFold 的时间和空间复杂性,但由于计算复杂,它没有扩展到包括纠缠贡献。这可能会以预测精度为代价。在这项工作中,我们比较了三种不同的悬垂贡献稀疏化实现方法,并提供了每种方法的优缺点。此外,我们还将我们的算法与线性时空算法 LinearFold 进行了比较,发现在实际应用中,SparseRNAFolD 在所有长度的序列中都具有更低的内存消耗,而在长度不超过 1000 个碱基的序列中耗时更短:我们的 SparseRNAFolD 算法是一种基于 MFE 的算法,它保证了结果的最优性,并采用了最通用的能量模型,包括纠缠贡献。我们为在无伪节点模型中将当差应用于稀疏递归提供了基础,该模型有可能扩展到伪节点。
{"title":"SparseRNAfolD: optimized sparse RNA pseudoknot-free folding with dangle consideration.","authors":"Mateo Gray, Sebastian Will, Hosna Jabbari","doi":"10.1186/s13015-024-00256-4","DOIUrl":"10.1186/s13015-024-00256-4","url":null,"abstract":"<p><strong>Motivation: </strong>Computational RNA secondary structure prediction by free energy minimization is indispensable for analyzing structural RNAs and their interactions. These methods find the structure with the minimum free energy (MFE) among exponentially many possible structures and have a restrictive time and space complexity ( <math><mrow><mi>O</mi> <mo>(</mo> <msup><mi>n</mi> <mn>3</mn></msup> <mo>)</mo></mrow> </math> time and <math><mrow><mi>O</mi> <mo>(</mo> <msup><mi>n</mi> <mn>2</mn></msup> <mo>)</mo></mrow> </math> space for pseudoknot-free structures) for longer RNA sequences. Furthermore, accurate free energy calculations, including dangle contributions can be difficult and costly to implement, particularly when optimizing for time and space requirements.</p><p><strong>Results: </strong>Here we introduce a fast and efficient sparsified MFE pseudoknot-free structure prediction algorithm, SparseRNAFolD, that utilizes an accurate energy model that accounts for dangle contributions. While the sparsification technique was previously employed to improve the time and space complexity of a pseudoknot-free structure prediction method with a realistic energy model, SparseMFEFold, it was not extended to include dangle contributions due to the complexity of computation. This may come at the cost of prediction accuracy. In this work, we compare three different sparsified implementations for dangle contributions and provide pros and cons of each method. As well, we compare our algorithm to LinearFold, a linear time and space algorithm, where we find that in practice, SparseRNAFolD has lower memory consumption across all lengths of sequence and a faster time for lengths up to 1000 bases.</p><p><strong>Conclusion: </strong>Our SparseRNAFolD algorithm is an MFE-based algorithm that guarantees optimality of result and employs the most general energy model, including dangle contributions. We provide a basis for applying dangles to sparsified recursion in a pseudoknot-free model that has the potential to be extended to pseudoknots.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"9"},"PeriodicalIF":1.5,"publicationDate":"2024-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11289965/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140023205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Recombinations, chains and caps: resolving problems with the DCJ-indel model. 重组、链和帽:解决 DCJ-indel 模型的问题。
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-02-27 DOI: 10.1186/s13015-024-00253-7
Leonard Bohnenkämper

One of the most fundamental problems in genome rearrangement studies is the (genomic) distance problem. It is typically formulated as finding the minimum number of rearrangements under a model that are needed to transform one genome into the other. A powerful multi-chromosomal model is the Double Cut and Join (DCJ) model.While the DCJ model is not able to deal with some situations that occur in practice, like duplicated or lost regions, it was extended over time to handle these cases. First, it was extended to the DCJ-indel model, solving the issue of lost markers. Later ILP-solutions for so called natural genomes, in which each genomic region may occur an arbitrary number of times, were developed, enabling in theory to solve the distance problem for any pair of genomes. However, some theoretical and practical issues remained unsolved. On the theoretical side of things, there exist two disparate views of the DCJ-indel model, motivated in the same way, but with different conceptualizations that could not be reconciled so far. On the practical side, while ILP solutions for natural genomes typically perform well on telomere to telomere resolved genomes, they have been shown in recent years to quickly loose performance on genomes with a large number of contigs or linear chromosomes. This has been linked to a particular technique, namely capping. Simply put, capping circularizes linear chromosomes by concatenating them during solving time, increasing the solution space of the ILP superexponentially. Recently, we introduced a new conceptualization of the DCJ-indel model within the context of another rearrangement problem. In this manuscript, we will apply this new conceptualization to the distance problem. In doing this, we uncover the relation between the disparate conceptualizations of the DCJ-indel model. We are also able to derive an ILP solution to the distance problem that does not rely on capping. This solution significantly improves upon the performance of previous solutions on genomes with high numbers of contigs while still solving the problem exactly and being competitive in performance otherwise. We demonstrate the performance advantage on simulated genomes as well as showing its practical usefulness in an analysis of 11 Drosophila genomes.

基因组重排研究中最基本的问题之一是(基因组)距离问题。该问题通常被表述为寻找在一个模型下将一个基因组转化为另一个基因组所需的最小重排次数。双切和连接(DCJ)模型是一个强大的多染色体模型。虽然 DCJ 模型无法处理实际中出现的一些情况,如重复或丢失区域,但随着时间的推移,它被扩展以处理这些情况。首先,它被扩展为 DCJ-indel 模型,解决了丢失标记的问题。后来,针对每个基因组区域可能出现任意次数的所谓天然基因组开发了 ILP 解决方案,从理论上解决了任何一对基因组的距离问题。然而,一些理论和实践问题仍未得到解决。在理论方面,DCJ-indel 模型存在两种不同的观点,它们的动机相同,但概念不同,至今无法调和。在实际应用方面,虽然针对自然基因组的 ILP 解决方案通常在端粒到端粒解析基因组上表现良好,但近年来的研究表明,它们在具有大量等位基因或线性染色体的基因组上很快就会性能下降。这与一种特殊的技术有关,即封顶技术。简单地说,"封顶 "技术是在求解过程中通过串联线性染色体来实现线性染色体的循环,从而超指数地增加 ILP 的求解空间。最近,我们在另一个重排问题中引入了 DCJ-indel 模型的新概念。在本手稿中,我们将把这一新概念应用于距离问题。在此过程中,我们揭示了 DCJ-indel 模型不同概念之间的关系。我们还能为距离问题推导出一种不依赖封顶的 ILP 解决方案。这种解决方案大大提高了以前的解决方案在具有大量等位基因的基因组上的性能,同时还能精确地解决这个问题,并且在其他方面也具有竞争力。我们在模拟基因组上演示了这一性能优势,并在对 11 个果蝇基因组的分析中展示了它的实用性。
{"title":"Recombinations, chains and caps: resolving problems with the DCJ-indel model.","authors":"Leonard Bohnenkämper","doi":"10.1186/s13015-024-00253-7","DOIUrl":"10.1186/s13015-024-00253-7","url":null,"abstract":"<p><p>One of the most fundamental problems in genome rearrangement studies is the (genomic) distance problem. It is typically formulated as finding the minimum number of rearrangements under a model that are needed to transform one genome into the other. A powerful multi-chromosomal model is the Double Cut and Join (DCJ) model.While the DCJ model is not able to deal with some situations that occur in practice, like duplicated or lost regions, it was extended over time to handle these cases. First, it was extended to the DCJ-indel model, solving the issue of lost markers. Later ILP-solutions for so called natural genomes, in which each genomic region may occur an arbitrary number of times, were developed, enabling in theory to solve the distance problem for any pair of genomes. However, some theoretical and practical issues remained unsolved. On the theoretical side of things, there exist two disparate views of the DCJ-indel model, motivated in the same way, but with different conceptualizations that could not be reconciled so far. On the practical side, while ILP solutions for natural genomes typically perform well on telomere to telomere resolved genomes, they have been shown in recent years to quickly loose performance on genomes with a large number of contigs or linear chromosomes. This has been linked to a particular technique, namely capping. Simply put, capping circularizes linear chromosomes by concatenating them during solving time, increasing the solution space of the ILP superexponentially. Recently, we introduced a new conceptualization of the DCJ-indel model within the context of another rearrangement problem. In this manuscript, we will apply this new conceptualization to the distance problem. In doing this, we uncover the relation between the disparate conceptualizations of the DCJ-indel model. We are also able to derive an ILP solution to the distance problem that does not rely on capping. This solution significantly improves upon the performance of previous solutions on genomes with high numbers of contigs while still solving the problem exactly and being competitive in performance otherwise. We demonstrate the performance advantage on simulated genomes as well as showing its practical usefulness in an analysis of 11 Drosophila genomes.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"8"},"PeriodicalIF":1.0,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10900646/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139984424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unifying duplication episode clustering and gene-species mapping inference. 统一重复情节聚类和基因-物种映射推断。
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-02-14 DOI: 10.1186/s13015-024-00252-8
Paweł Górecki, Natalia Rutecka, Agnieszka Mykowiecka, Jarosław Paszek

We present a novel problem, called MetaEC, which aims to infer gene-species assignments in a collection of partially leaf-labeled gene trees labels by minimizing the size of duplication episode clustering (EC). This problem is particularly relevant in metagenomics, where incomplete data often poses a challenge in the accurate reconstruction of gene histories. To solve MetaEC, we propose a polynomial time dynamic programming (DP) formulation that verifies the existence of a set of duplication episodes from a predefined set of episode candidates. In addition, we design a method to infer distributions of gene-species mappings. We then demonstrate how to use DP to design an algorithm that solves MetaEC. Although the algorithm is exponential in the worst case, we introduce a heuristic modification of the algorithm that provides a solution with the knowledge that it is exact. To evaluate our method, we perform two computational experiments on simulated and empirical data containing whole genome duplication events, showing that our algorithm is able to accurately infer the corresponding events.

我们提出了一个名为 MetaEC 的新问题,其目的是通过最小化重复情节聚类(EC)的大小来推断部分叶标记基因树标签集合中的基因-物种分配。这个问题在元基因组学中尤为重要,因为不完整的数据往往会给基因历史的准确重建带来挑战。为了解决 MetaEC 问题,我们提出了一种多项式时间动态编程(DP)方法,它能从一组预定义的候选情节中验证是否存在一组重复情节。此外,我们还设计了一种推断基因-物种映射分布的方法。然后,我们演示了如何使用 DP 设计一种解决 MetaEC 的算法。虽然该算法在最坏情况下是指数级的,但我们引入了对算法的启发式修改,在知道它是精确的情况下提供一个解决方案。为了评估我们的方法,我们对包含全基因组复制事件的模拟数据和经验数据进行了两次计算实验,结果表明我们的算法能够准确推断出相应的事件。
{"title":"Unifying duplication episode clustering and gene-species mapping inference.","authors":"Paweł Górecki, Natalia Rutecka, Agnieszka Mykowiecka, Jarosław Paszek","doi":"10.1186/s13015-024-00252-8","DOIUrl":"10.1186/s13015-024-00252-8","url":null,"abstract":"<p><p>We present a novel problem, called MetaEC, which aims to infer gene-species assignments in a collection of partially leaf-labeled gene trees labels by minimizing the size of duplication episode clustering (EC). This problem is particularly relevant in metagenomics, where incomplete data often poses a challenge in the accurate reconstruction of gene histories. To solve MetaEC, we propose a polynomial time dynamic programming (DP) formulation that verifies the existence of a set of duplication episodes from a predefined set of episode candidates. In addition, we design a method to infer distributions of gene-species mappings. We then demonstrate how to use DP to design an algorithm that solves MetaEC. Although the algorithm is exponential in the worst case, we introduce a heuristic modification of the algorithm that provides a solution with the knowledge that it is exact. To evaluate our method, we perform two computational experiments on simulated and empirical data containing whole genome duplication events, showing that our algorithm is able to accurately infer the corresponding events.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"7"},"PeriodicalIF":1.0,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10865717/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139736664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting horizontal gene transfers with perfect transfer networks. 用完美的转移网络预测横向基因转移。
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-02-06 DOI: 10.1186/s13015-023-00242-2
Alitzel López Sánchez, Manuel Lafond

Background: Horizontal gene transfer inference approaches are usually based on gene sequences: parametric methods search for patterns that deviate from a particular genomic signature, while phylogenetic methods use sequences to reconstruct the gene and species trees. However, it is well-known that sequences have difficulty identifying ancient transfers since mutations have enough time to erase all evidence of such events. In this work, we ask whether character-based methods can predict gene transfers. Their advantage over sequences is that homologous genes can have low DNA similarity, but still have retained enough important common motifs that allow them to have common character traits, for instance the same functional or expression profile. A phylogeny that has two separate clades that acquired the same character independently might indicate the presence of a transfer even in the absence of sequence similarity.

Our contributions: We introduce perfect transfer networks, which are phylogenetic networks that can explain the character diversity of a set of taxa under the assumption that characters have unique births, and that once a character is gained it is rarely lost. Examples of such traits include transposable elements, biochemical markers and emergence of organelles, just to name a few. We study the differences between our model and two similar models: perfect phylogenetic networks and ancestral recombination networks. Our goals are to initiate a study on the structural and algorithmic properties of perfect transfer networks. We then show that in polynomial time, one can decide whether a given network is a valid explanation for a set of taxa, and show how, for a given tree, one can add transfer edges to it so that it explains a set of taxa. We finally provide lower and upper bounds on the number of transfers required to explain a set of taxa, in the worst case.

背景:水平基因转移推断方法通常以基因序列为基础:参数法寻找偏离特定基因组特征的模式,而系统发育法则利用序列重建基因树和物种树。然而,众所周知,序列很难识别古老的转移,因为突变有足够的时间抹去这类事件的所有证据。在这项研究中,我们提出了基于特征的方法能否预测基因转移的问题。与序列相比,基于特征的方法的优势在于,同源基因的 DNA 相似性可以很低,但仍然保留了足够多的重要共性,使它们具有共同的特征,例如相同的功能或表达谱。如果一个系统发育中有两个独立的支系独立地获得了相同的特征,那么即使没有序列相似性,也可能表明存在转移:我们介绍了完美的转移网络,这种系统发育网络可以解释一组类群的特征多样性,其假设条件是特征具有唯一的诞生,而且一旦获得特征就很少丢失。这类特征的例子包括转座元件、生化标记和细胞器的出现等等。我们将研究我们的模型与两个类似模型之间的差异:完美的系统发生网络和祖先重组网络。我们的目标是启动对完美转移网络的结构和算法特性的研究。然后,我们证明了在多项式时间内,我们可以决定一个给定的网络是否能有效地解释一组类群,并证明了对于一个给定的树,我们可以如何添加转移边,从而使它能解释一组类群。最后,我们给出了在最坏情况下解释一组类群所需的转移数量的下限和上限。
{"title":"Predicting horizontal gene transfers with perfect transfer networks.","authors":"Alitzel López Sánchez, Manuel Lafond","doi":"10.1186/s13015-023-00242-2","DOIUrl":"10.1186/s13015-023-00242-2","url":null,"abstract":"<p><strong>Background: </strong>Horizontal gene transfer inference approaches are usually based on gene sequences: parametric methods search for patterns that deviate from a particular genomic signature, while phylogenetic methods use sequences to reconstruct the gene and species trees. However, it is well-known that sequences have difficulty identifying ancient transfers since mutations have enough time to erase all evidence of such events. In this work, we ask whether character-based methods can predict gene transfers. Their advantage over sequences is that homologous genes can have low DNA similarity, but still have retained enough important common motifs that allow them to have common character traits, for instance the same functional or expression profile. A phylogeny that has two separate clades that acquired the same character independently might indicate the presence of a transfer even in the absence of sequence similarity.</p><p><strong>Our contributions: </strong>We introduce perfect transfer networks, which are phylogenetic networks that can explain the character diversity of a set of taxa under the assumption that characters have unique births, and that once a character is gained it is rarely lost. Examples of such traits include transposable elements, biochemical markers and emergence of organelles, just to name a few. We study the differences between our model and two similar models: perfect phylogenetic networks and ancestral recombination networks. Our goals are to initiate a study on the structural and algorithmic properties of perfect transfer networks. We then show that in polynomial time, one can decide whether a given network is a valid explanation for a set of taxa, and show how, for a given tree, one can add transfer edges to it so that it explains a set of taxa. We finally provide lower and upper bounds on the number of transfers required to explain a set of taxa, in the worst case.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"6"},"PeriodicalIF":1.0,"publicationDate":"2024-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10848447/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139698836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Global exact optimisations for chloroplast structural haplotype scaffolding. 叶绿体结构单体型支架的全局精确优化。
IF 1.5 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-02-06 DOI: 10.1186/s13015-023-00243-1
Victor Epain, Rumen Andonov

Background: Scaffolding is an intermediate stage of fragment assembly. It consists in orienting and ordering the contigs obtained by the assembly of the sequencing reads. In the general case, the problem has been largely studied with the use of distances data between the contigs. Here we focus on a dedicated scaffolding for the chloroplast genomes. As these genomes are small, circular and with few specific repeats, numerous approaches have been proposed to assemble them. However, their specificities have not been sufficiently exploited.

Results: We give a new formulation for the scaffolding in the case of chloroplast genomes as a discrete optimisation problem, that we prove the decision version to be [Formula: see text]-Complete. We take advantage of the knowledge of chloroplast genomes and succeed in expressing the relationships between a few specific genomic repeats in mathematical constraints. Our approach is independent of the distances and adopts a genomic regions view, with the priority on scaffolding the repeats first. In this way, we encode the structural haplotype issue in order to retrieve several genome forms that coexist in the same chloroplast cell. To solve exactly the optimisation problem, we develop an integer linear program that we implement in Python3 package khloraascaf. We test it on synthetic data to investigate its performance behaviour and its robustness against several chosen difficulties.

Conclusions: We succeed to model biological knowledge on genomic structures to scaffold chloroplast genomes. Our results suggest that modelling genomic regions is sufficient for scaffolding repeats and is suitable for finding several solutions corresponding to several genome forms.

背景:脚手架是片段组装的中间阶段。它包括对测序读数组装得到的等位基因进行定向和排序。在一般情况下,这个问题主要是利用等位基因之间的距离数据来研究的。在这里,我们重点研究叶绿体基因组的专用支架。由于叶绿体基因组较小、环状且很少有特异性重复,因此人们提出了许多方法来组装这些基因组。然而,这些基因组的特异性尚未得到充分利用:结果:我们给出了叶绿体基因组支架组装的新方案,将其视为离散优化问题,并证明决策版[公式:见正文]是完整的。我们利用叶绿体基因组的知识,成功地用数学约束条件表达了几个特定基因组重复序列之间的关系。我们的方法与距离无关,采用基因组区域视角,优先考虑重复序列。通过这种方式,我们将结构单体型问题编码,以便检索出在同一叶绿体细胞中共存的几种基因组形式。为了准确解决优化问题,我们开发了一个整数线性程序,并在 Python3 软件包 khloraascaf 中实现。我们在合成数据上对其进行了测试,以研究它的性能表现及其对所选困难的鲁棒性:我们成功地模拟了有关基因组结构的生物知识,为叶绿体基因组搭建了支架。我们的结果表明,对基因组区域进行建模足以构建重复的支架,而且适合找到与多种基因组形式相对应的多种解决方案。
{"title":"Global exact optimisations for chloroplast structural haplotype scaffolding.","authors":"Victor Epain, Rumen Andonov","doi":"10.1186/s13015-023-00243-1","DOIUrl":"10.1186/s13015-023-00243-1","url":null,"abstract":"<p><strong>Background: </strong>Scaffolding is an intermediate stage of fragment assembly. It consists in orienting and ordering the contigs obtained by the assembly of the sequencing reads. In the general case, the problem has been largely studied with the use of distances data between the contigs. Here we focus on a dedicated scaffolding for the chloroplast genomes. As these genomes are small, circular and with few specific repeats, numerous approaches have been proposed to assemble them. However, their specificities have not been sufficiently exploited.</p><p><strong>Results: </strong>We give a new formulation for the scaffolding in the case of chloroplast genomes as a discrete optimisation problem, that we prove the decision version to be [Formula: see text]-Complete. We take advantage of the knowledge of chloroplast genomes and succeed in expressing the relationships between a few specific genomic repeats in mathematical constraints. Our approach is independent of the distances and adopts a genomic regions view, with the priority on scaffolding the repeats first. In this way, we encode the structural haplotype issue in order to retrieve several genome forms that coexist in the same chloroplast cell. To solve exactly the optimisation problem, we develop an integer linear program that we implement in Python3 package khloraascaf. We test it on synthetic data to investigate its performance behaviour and its robustness against several chosen difficulties.</p><p><strong>Conclusions: </strong>We succeed to model biological knowledge on genomic structures to scaffold chloroplast genomes. Our results suggest that modelling genomic regions is sufficient for scaffolding repeats and is suitable for finding several solutions corresponding to several genome forms.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"5"},"PeriodicalIF":1.5,"publicationDate":"2024-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11288059/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139698835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Algorithms for Molecular Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1