首页 > 最新文献

Algorithms for Molecular Biology最新文献

英文 中文
Co-linear chaining on pangenome graphs. 盘根图上的共线链。
IF 1.5 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-01-27 DOI: 10.1186/s13015-024-00250-w
Jyotshna Rajput, Ghanshyam Chandra, Chirag Jain

Pangenome reference graphs are useful in genomics because they compactly represent the genetic diversity within a species, a capability that linear references lack. However, efficiently aligning sequences to these graphs with complex topology and cycles can be challenging. The seed-chain-extend based alignment algorithms use co-linear chaining as a standard technique to identify a good cluster of exact seed matches that can be combined to form an alignment. Recent works show how the co-linear chaining problem can be efficiently solved for acyclic pangenome graphs by exploiting their small width and how incorporating gap cost in the scoring function improves alignment accuracy. However, it remains open on how to effectively generalize these techniques for general pangenome graphs which contain cycles. Here we present the first practical formulation and an exact algorithm for co-linear chaining on cyclic pangenome graphs. We rigorously prove the correctness and computational complexity of the proposed algorithm. We evaluate the empirical performance of our algorithm by aligning simulated long reads from the human genome to a cyclic pangenome graph constructed from 95 publicly available haplotype-resolved human genome assemblies. While the existing heuristic-based algorithms are faster, the proposed algorithm provides a significant advantage in terms of accuracy. Implementation ( https://github.com/at-cg/PanAligner ).

Pangenome 参考图在基因组学中非常有用,因为它们紧凑地表示了一个物种内的遗传多样性,而线性参考图则缺乏这种能力。然而,要将序列有效地与这些具有复杂拓扑结构和循环的图进行比对是一项挑战。基于种子链扩展的比对算法使用共线性链作为标准技术来识别精确种子匹配的良好群组,并将其组合起来形成比对。最近的研究表明,对于非循环庞基因组图,如何利用其较小的宽度来有效解决共线性链问题,以及如何在评分函数中加入间隙成本来提高配准精度。然而,如何将这些技术有效地推广到包含循环的一般庞基因组图中,仍然是一个未知数。在这里,我们首次提出了在循环庞基因组图上进行共线性连锁的实用公式和精确算法。我们严格证明了所提算法的正确性和计算复杂性。我们通过将模拟的人类基因组长读数与由 95 个公开的单倍型解析人类基因组组装构建的循环庞基因组图进行比对,评估了我们算法的经验性能。虽然现有的基于启发式的算法速度更快,但所提出的算法在准确性方面具有显著优势。实现 ( https://github.com/at-cg/PanAligner )。
{"title":"Co-linear chaining on pangenome graphs.","authors":"Jyotshna Rajput, Ghanshyam Chandra, Chirag Jain","doi":"10.1186/s13015-024-00250-w","DOIUrl":"10.1186/s13015-024-00250-w","url":null,"abstract":"<p><p>Pangenome reference graphs are useful in genomics because they compactly represent the genetic diversity within a species, a capability that linear references lack. However, efficiently aligning sequences to these graphs with complex topology and cycles can be challenging. The seed-chain-extend based alignment algorithms use co-linear chaining as a standard technique to identify a good cluster of exact seed matches that can be combined to form an alignment. Recent works show how the co-linear chaining problem can be efficiently solved for acyclic pangenome graphs by exploiting their small width and how incorporating gap cost in the scoring function improves alignment accuracy. However, it remains open on how to effectively generalize these techniques for general pangenome graphs which contain cycles. Here we present the first practical formulation and an exact algorithm for co-linear chaining on cyclic pangenome graphs. We rigorously prove the correctness and computational complexity of the proposed algorithm. We evaluate the empirical performance of our algorithm by aligning simulated long reads from the human genome to a cyclic pangenome graph constructed from 95 publicly available haplotype-resolved human genome assemblies. While the existing heuristic-based algorithms are faster, the proposed algorithm provides a significant advantage in terms of accuracy. Implementation ( https://github.com/at-cg/PanAligner ).</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"4"},"PeriodicalIF":1.5,"publicationDate":"2024-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11288099/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139567423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fulgor: a fast and compact k-mer index for large-scale matching and color queries. Fulgor:用于大规模匹配和颜色查询的快速紧凑型 k-mer 索引。
IF 1.5 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-01-22 DOI: 10.1186/s13015-024-00251-9
Jason Fan, Jamshed Khan, Noor Pratap Singh, Giulio Ermanno Pibiri, Rob Patro

The problem of sequence identification or matching-determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence-is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe an efficient colored de Bruijn graph index, arising as the combination of a k-mer dictionary with a compressed inverted index. The proposed index takes full advantage of the fact that unitigs in the colored compacted de Bruijn graph are monochromatic (i.e., all k-mers in a unitig have the same set of references of origin, or color). Specifically, the unitigs are kept in the dictionary in color order, thereby allowing for the encoding of the map from k-mers to their colors in as little as 1 + o(1) bits per unitig. Hence, one color per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for integer lists, the index achieves very small space. We implement these methods in a tool called Fulgor, and conduct an extensive experimental analysis to demonstrate the improvement of our tool over previous solutions. For example, compared to Themisto-the strongest competitor in terms of index space vs. query time trade-off-Fulgor requires significantly less space (up to 43% less space for a collection of 150,000 Salmonella enterica genomes), is at least twice as fast for color queries, and is 2-6[Formula: see text] faster to construct.

序列识别或匹配问题--从给定的参考文献库中确定可能包含短核苷酸序列的参考序列子集--与计算生物学中的许多重要任务相关,如元基因组学和泛基因组分析。由于此类分析的复杂性和参考文献库的庞大规模,解决这一问题的资源效率解决方案至关重要。这就提出了三方面的挑战:用一种查询效率高、内存使用少、可扩展到大型参考文献集的数据结构来表示参考文献集。为了解决这个问题,我们描述了一种高效的彩色 de Bruijn 图索引,它是 k-mer 字典与压缩倒排索引的结合。所提出的索引充分利用了彩色压缩 de Bruijn 图中的单元格是单色的这一事实(即单元格中的所有 k-mer 都有相同的来源参考集或颜色)。具体来说,字典中的单元格是按颜色顺序排列的,因此每个单元格只需 1 + o(1) 比特就能完成从 k-mers 到其颜色的映射编码。因此,索引中每个单元格只存储一种颜色,几乎没有空间/时间开销。通过将这一特性与简单而有效的整数列表压缩方法相结合,索引实现了非常小的空间。我们在名为 Fulgor 的工具中实现了这些方法,并进行了广泛的实验分析,以证明我们的工具比以前的解决方案有所改进。例如,与索引空间与查询时间权衡方面最强劲的竞争对手 Themisto 相比,Fulgor 所需的空间大大减少(对于 15 万个肠炎沙门氏菌基因组集合而言,空间最多可减少 43%),对于彩色查询而言,速度至少快两倍,而且构建速度快 2-6[公式:见正文]。
{"title":"Fulgor: a fast and compact k-mer index for large-scale matching and color queries.","authors":"Jason Fan, Jamshed Khan, Noor Pratap Singh, Giulio Ermanno Pibiri, Rob Patro","doi":"10.1186/s13015-024-00251-9","DOIUrl":"10.1186/s13015-024-00251-9","url":null,"abstract":"<p><p>The problem of sequence identification or matching-determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence-is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe an efficient colored de Bruijn graph index, arising as the combination of a k-mer dictionary with a compressed inverted index. The proposed index takes full advantage of the fact that unitigs in the colored compacted de Bruijn graph are monochromatic (i.e., all k-mers in a unitig have the same set of references of origin, or color). Specifically, the unitigs are kept in the dictionary in color order, thereby allowing for the encoding of the map from k-mers to their colors in as little as 1 + o(1) bits per unitig. Hence, one color per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for integer lists, the index achieves very small space. We implement these methods in a tool called Fulgor, and conduct an extensive experimental analysis to demonstrate the improvement of our tool over previous solutions. For example, compared to Themisto-the strongest competitor in terms of index space vs. query time trade-off-Fulgor requires significantly less space (up to 43% less space for a collection of 150,000 Salmonella enterica genomes), is at least twice as fast for color queries, and is 2-6[Formula: see text] faster to construct.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"3"},"PeriodicalIF":1.5,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10810250/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139522095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dollo-CDP: a polynomial-time algorithm for the clade-constrained large Dollo parsimony problem. Dollo-CDP:支系受限大 Dollo 解析问题的多项式时间算法。
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-01-08 DOI: 10.1186/s13015-023-00249-9
Junyan Dai, Tobias Rubel, Yunheng Han, Erin K Molloy

The last decade of phylogenetics has seen the development of many methods that leverage constraints plus dynamic programming. The goal of this algorithmic technique is to produce a phylogeny that is optimal with respect to some objective function and that lies within a constrained version of tree space. The popular species tree estimation method ASTRAL, for example, returns a tree that (1) maximizes the quartet score computed with respect to the input gene trees and that (2) draws its branches (bipartitions) from the input constraint set. This technique has yet to be used for parsimony problems where the input are binary characters, sometimes with missing values. Here, we introduce the clade-constrained character parsimony problem and present an algorithm that solves this problem for the Dollo criterion score in [Formula: see text] time, where n is the number of leaves, k is the number of characters, and [Formula: see text] is the set of clades used as constraints. Dollo parsimony, which requires traits/mutations to be gained at most once but allows them to be lost any number of times, is widely used for tumor phylogenetics as well as species phylogenetics, for example analyses of low-homoplasy retroelement insertions across the vertebrate tree of life. This motivated us to implement our algorithm in a software package, called Dollo-CDP, and evaluate its utility for analyzing retroelement insertion presence / absence patterns for bats, birds, toothed whales as well as simulated data. Our results show that Dollo-CDP can improve upon heuristic search from a single starting tree, often recovering a better scoring tree. Moreover, Dollo-CDP scales to data sets with much larger numbers of taxa than branch-and-bound while still having an optimality guarantee, albeit a more restricted one. Lastly, we show that our algorithm for Dollo parsimony can easily be adapted to Camin-Sokal parsimony but not Fitch parsimony.

在过去的十年中,系统发育学界开发出了许多利用约束条件和动态编程的方法。这种算法技术的目标是生成一个与某些目标函数相关的最优系统发育树,该系统发育树位于树空间的约束版本中。例如,流行的物种树估计方法 ASTRAL 所返回的物种树(1)能最大化根据输入基因树计算出的四元组得分,(2)能从输入约束集中提取分支(双分区)。这种技术尚未用于输入为二元字符(有时是缺失值)的解析问题。在这里,我们介绍了支系约束的字符解析问题,并提出了一种算法,可以在[公式:见正文]时间内求解该问题的 Dollo 准则得分,其中 n 是叶子数,k 是字符数,[公式:见正文]是用作约束的支系集。Dollo解析法要求性状/突变最多获得一次,但允许它们丢失任意多次,它被广泛用于肿瘤系统发育学和物种系统发育学,例如脊椎动物生命树中低同源逆转录插入的分析。这促使我们在一个名为 Dollo-CDP 的软件包中实现了我们的算法,并评估了它在分析蝙蝠、鸟类、齿鲸以及模拟数据的逆位点插入存在/缺失模式方面的实用性。我们的研究结果表明,Dollo-CDP 可以改进从单个起始树出发的启发式搜索,往往能恢复出更好的得分树。此外,Dollo-CDP 还能扩展到具有比分支-边界法更多分类群的数据集,同时还能保证最优性,尽管最优性受到了更多限制。最后,我们还展示了我们的 Dollo 解析算法可以很容易地适用于 Camin-Sokal 解析,但不能适用于 Fitch 解析。
{"title":"Dollo-CDP: a polynomial-time algorithm for the clade-constrained large Dollo parsimony problem.","authors":"Junyan Dai, Tobias Rubel, Yunheng Han, Erin K Molloy","doi":"10.1186/s13015-023-00249-9","DOIUrl":"10.1186/s13015-023-00249-9","url":null,"abstract":"<p><p>The last decade of phylogenetics has seen the development of many methods that leverage constraints plus dynamic programming. The goal of this algorithmic technique is to produce a phylogeny that is optimal with respect to some objective function and that lies within a constrained version of tree space. The popular species tree estimation method ASTRAL, for example, returns a tree that (1) maximizes the quartet score computed with respect to the input gene trees and that (2) draws its branches (bipartitions) from the input constraint set. This technique has yet to be used for parsimony problems where the input are binary characters, sometimes with missing values. Here, we introduce the clade-constrained character parsimony problem and present an algorithm that solves this problem for the Dollo criterion score in [Formula: see text] time, where n is the number of leaves, k is the number of characters, and [Formula: see text] is the set of clades used as constraints. Dollo parsimony, which requires traits/mutations to be gained at most once but allows them to be lost any number of times, is widely used for tumor phylogenetics as well as species phylogenetics, for example analyses of low-homoplasy retroelement insertions across the vertebrate tree of life. This motivated us to implement our algorithm in a software package, called Dollo-CDP, and evaluate its utility for analyzing retroelement insertion presence / absence patterns for bats, birds, toothed whales as well as simulated data. Our results show that Dollo-CDP can improve upon heuristic search from a single starting tree, often recovering a better scoring tree. Moreover, Dollo-CDP scales to data sets with much larger numbers of taxa than branch-and-bound while still having an optimality guarantee, albeit a more restricted one. Lastly, we show that our algorithm for Dollo parsimony can easily be adapted to Camin-Sokal parsimony but not Fitch parsimony.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"2"},"PeriodicalIF":1.0,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10775561/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139405043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Investigating the complexity of the double distance problems 研究双重距离问题的复杂性
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-01-04 DOI: 10.1186/s13015-023-00246-y
Marília D. V. Braga, Leonie R. Brockmann, Katharina Klerx, Jens Stoye
Two genomes $$mathbb {A}$$ and $$mathbb {B}$$ over the same set of gene families form a canonical pair when each of them has exactly one gene from each family. Denote by $$n_*$$ the number of common families of $$mathbb {A}$$ and $$mathbb {B}$$ . Different distances of canonical genomes can be derived from a structure called breakpoint graph, which represents the relation between the two given genomes as a collection of cycles of even length and paths. Let $$c_i$$ and $$p_j$$ be respectively the numbers of cycles of length i and of paths of length j in the breakpoint graph of genomes $$mathbb {A}$$ and $$mathbb {B}$$ . Then, the breakpoint distance of $$mathbb {A}$$ and $$mathbb {B}$$ is equal to $$n_*-left( c_2+frac{p_0}{2}right)$$ . Similarly, when the considered rearrangements are those modeled by the double-cut-and-join (DCJ) operation, the rearrangement distance of $$mathbb {A}$$ and $$mathbb {B}$$ is $$n_*-left( c+frac{p_e }{2}right)$$ , where c is the total number of cycles and $$p_e$$ is the total number of paths of even length. The distance formulation is a basic unit for several other combinatorial problems related to genome evolution and ancestral reconstruction, such as median or double distance. Interestingly, both median and double distance problems can be solved in polynomial time for the breakpoint distance, while they are NP-hard for the rearrangement distance. One way of exploring the complexity space between these two extremes is to consider a $$sigma _k$$ distance, defined to be $$n_*-left( c_2+c_4+ldots +c_k+frac{p_0+p_2+ldots +p_{k-2}}{2}right)$$ , and increasingly investigate the complexities of median and double distance for the $$sigma _4$$ distance, then the $$sigma _6$$ distance, and so on. While for the median much effort was done in our and in other research groups but no progress was obtained even for the $$sigma _4$$ distance, for solving the double distance under $$sigma _4$$ and $$sigma _6$$ distances we could devise linear time algorithms, which we present here.
如果两个基因组 $$mathbb {A}$ 和 $$mathbb {B}$ 都有来自同一个基因家族的一个基因,那么这两个基因组就形成了一对典型基因组。用 $$n_*$ 表示 $$mathbb {A}$ 和 $$mathbb {B}$ 的共同族的数目。典型基因组的不同距离可以从一种叫做断点图(breakpoint graph)的结构中推导出来,这种结构将两个给定基因组之间的关系表示为偶数长度的循环和路径的集合。假设 $$c_i$$ 和 $$p_j$ 分别是基因组 $$mathbb {A}$ 和 $$mathbb {B}$ 的断点图中长度为 i 的循环数和长度为 j 的路径数。那么,$$mathbb {A}$ 和 $$mathbb {B}$ 的断点距离等于 $$n_*-left( c_2+frac{p_0}{2}right)$$ 。同样,当考虑的重排是由双切-接(DCJ)操作模拟的重排时,$$mathbb {A}$ 和 $$mathbb {B}$ 的重排距离为 $$n_*-left( c+frac{p_e }{2}right)$$ ,其中 c 是循环的总数,$$p_e$$ 是偶数长度路径的总数。距离公式是其他几个与基因组进化和祖先重建相关的组合问题(如中值距离或双倍距离)的基本单元。有趣的是,对于断点距离来说,中值距离和双倍距离问题都可以在多项式时间内求解,而对于重排距离来说,它们都是 NP-困难的。探索这两个极端之间复杂性空间的一种方法是考虑 $$sigma _k$$ 距离,定义为 $$n_*-left( c_2+c_4+ldots +c_k+frac{p_0+p_2+ldots +p_{k-2}}{2}right)$$ 、并越来越多地研究 $$sigma _4$$ 距离的中值距离和双倍距离的复杂性,然后是 $$sigma _6$$ 距离,等等。对于中值距离,我们和其他研究小组做了很多努力,但即使对于 $$sigma _4$ 距离也没有取得进展,而对于求解 $$sigma _4$$ 和 $$sigma _6$ 距离下的双倍距离,我们可以设计出线性时间算法,我们在此介绍这些算法。
{"title":"Investigating the complexity of the double distance problems","authors":"Marília D. V. Braga, Leonie R. Brockmann, Katharina Klerx, Jens Stoye","doi":"10.1186/s13015-023-00246-y","DOIUrl":"https://doi.org/10.1186/s13015-023-00246-y","url":null,"abstract":"Two genomes $$mathbb {A}$$ and $$mathbb {B}$$ over the same set of gene families form a canonical pair when each of them has exactly one gene from each family. Denote by $$n_*$$ the number of common families of $$mathbb {A}$$ and $$mathbb {B}$$ . Different distances of canonical genomes can be derived from a structure called breakpoint graph, which represents the relation between the two given genomes as a collection of cycles of even length and paths. Let $$c_i$$ and $$p_j$$ be respectively the numbers of cycles of length i and of paths of length j in the breakpoint graph of genomes $$mathbb {A}$$ and $$mathbb {B}$$ . Then, the breakpoint distance of $$mathbb {A}$$ and $$mathbb {B}$$ is equal to $$n_*-left( c_2+frac{p_0}{2}right)$$ . Similarly, when the considered rearrangements are those modeled by the double-cut-and-join (DCJ) operation, the rearrangement distance of $$mathbb {A}$$ and $$mathbb {B}$$ is $$n_*-left( c+frac{p_e }{2}right)$$ , where c is the total number of cycles and $$p_e$$ is the total number of paths of even length. The distance formulation is a basic unit for several other combinatorial problems related to genome evolution and ancestral reconstruction, such as median or double distance. Interestingly, both median and double distance problems can be solved in polynomial time for the breakpoint distance, while they are NP-hard for the rearrangement distance. One way of exploring the complexity space between these two extremes is to consider a $$sigma _k$$ distance, defined to be $$n_*-left( c_2+c_4+ldots +c_k+frac{p_0+p_2+ldots +p_{k-2}}{2}right)$$ , and increasingly investigate the complexities of median and double distance for the $$sigma _4$$ distance, then the $$sigma _6$$ distance, and so on. While for the median much effort was done in our and in other research groups but no progress was obtained even for the $$sigma _4$$ distance, for solving the double distance under $$sigma _4$$ and $$sigma _6$$ distances we could devise linear time algorithms, which we present here.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":""},"PeriodicalIF":1.0,"publicationDate":"2024-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139095860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EMMA: a new method for computing multiple sequence alignments given a constraint subset alignment EMMA:给定约束子集排列的计算多序列排列的新方法
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2023-12-07 DOI: 10.1186/s13015-023-00247-x
Chengze Shen, Baqiao Liu, Kelly P. Williams, Tandy Warnow
Adding sequences into an existing (possibly user-provided) alignment has multiple applications, including updating a large alignment with new data, adding sequences into a constraint alignment constructed using biological knowledge, or computing alignments in the presence of sequence length heterogeneity. Although this is a natural problem, only a few tools have been developed to use this information with high fidelity. We present EMMA (Extending Multiple alignments using MAFFT--add) for the problem of adding a set of unaligned sequences into a multiple sequence alignment (i.e., a constraint alignment). EMMA builds on MAFFT--add, which is also designed to add sequences into a given constraint alignment. EMMA improves on MAFFT--add methods by using a divide-and-conquer framework to scale its most accurate version, MAFFT-linsi--add, to constraint alignments with many sequences. We show that EMMA has an accuracy advantage over other techniques for adding sequences into alignments under many realistic conditions and can scale to large datasets with high accuracy (hundreds of thousands of sequences). EMMA is available at https://github.com/c5shen/EMMA . EMMA is a new tool that provides high accuracy and scalability for adding sequences into an existing alignment.
将序列添加到现有的(可能是用户提供的)比对中有多种应用,包括用新数据更新大型比对、将序列添加到用生物知识构建的约束比对中,或在序列长度异质性的情况下计算比对。虽然这是一个自然问题,但目前只有少数工具能高保真地使用这些信息。我们提出的 EMMA(使用 MAFFT--add扩展多序列对齐)可解决将一组未对齐序列添加到多序列对齐(即约束对齐)中的问题。EMMA建立在MAFFT--add的基础上,MAFFT--add也是为了将序列添加到给定的约束比对中而设计的。EMMA改进了MAFFT--add方法,使用分而治之的框架将其最精确的版本MAFFT--linsi--add扩展到多序列的约束对齐。我们的研究表明,在许多现实条件下,EMMA在将序列添加到对齐中方面比其他技术更准确,而且能以高准确度(数十万条序列)扩展到大型数据集。EMMA 可在 https://github.com/c5shen/EMMA 上获取。EMMA是一种新工具,可将序列添加到现有的排列中,具有高准确性和可扩展性。
{"title":"EMMA: a new method for computing multiple sequence alignments given a constraint subset alignment","authors":"Chengze Shen, Baqiao Liu, Kelly P. Williams, Tandy Warnow","doi":"10.1186/s13015-023-00247-x","DOIUrl":"https://doi.org/10.1186/s13015-023-00247-x","url":null,"abstract":"Adding sequences into an existing (possibly user-provided) alignment has multiple applications, including updating a large alignment with new data, adding sequences into a constraint alignment constructed using biological knowledge, or computing alignments in the presence of sequence length heterogeneity. Although this is a natural problem, only a few tools have been developed to use this information with high fidelity. We present EMMA (Extending Multiple alignments using MAFFT--add) for the problem of adding a set of unaligned sequences into a multiple sequence alignment (i.e., a constraint alignment). EMMA builds on MAFFT--add, which is also designed to add sequences into a given constraint alignment. EMMA improves on MAFFT--add methods by using a divide-and-conquer framework to scale its most accurate version, MAFFT-linsi--add, to constraint alignments with many sequences. We show that EMMA has an accuracy advantage over other techniques for adding sequences into alignments under many realistic conditions and can scale to large datasets with high accuracy (hundreds of thousands of sequences). EMMA is available at https://github.com/c5shen/EMMA . EMMA is a new tool that provides high accuracy and scalability for adding sequences into an existing alignment.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"23 1","pages":""},"PeriodicalIF":1.0,"publicationDate":"2023-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138555235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Correction: Constructing founder sets under allelic and non-allelic homologous recombination. 更正:在等位基因和非等位基因同源重组下构建方正集。
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2023-12-06 DOI: 10.1186/s13015-023-00244-0
Konstantinn Bonnet, Tobias Marschall, Daniel Doerr
{"title":"Correction: Constructing founder sets under allelic and non-allelic homologous recombination.","authors":"Konstantinn Bonnet, Tobias Marschall, Daniel Doerr","doi":"10.1186/s13015-023-00244-0","DOIUrl":"10.1186/s13015-023-00244-0","url":null,"abstract":"","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"18 1","pages":"20"},"PeriodicalIF":1.0,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10698948/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138500077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Quartets enable statistically consistent estimation of cell lineage trees under an unbiased error and missingness model. 在无偏误差和缺失模型下,四重奏使细胞谱系树的统计一致估计成为可能。
IF 1.5 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2023-12-01 DOI: 10.1186/s13015-023-00248-w
Yunheng Han, Erin K Molloy

Cancer progression and treatment can be informed by reconstructing its evolutionary history from tumor cells. Although many methods exist to estimate evolutionary trees (called phylogenies) from molecular sequences, traditional approaches assume the input data are error-free and the output tree is fully resolved. These assumptions are challenged in tumor phylogenetics because single-cell sequencing produces sparse, error-ridden data and because tumors evolve clonally. Here, we study the theoretical utility of methods based on quartets (four-leaf, unrooted phylogenetic trees) in light of these barriers. We consider a popular tumor phylogenetics model, in which mutations arise on a (highly unresolved) tree and then (unbiased) errors and missing values are introduced. Quartets are then implied by mutations present in two cells and absent from two cells. Our main result is that the most probable quartet identifies the unrooted model tree on four cells. This motivates seeking a tree such that the number of quartets shared between it and the input mutations is maximized. We prove an optimal solution to this problem is a consistent estimator of the unrooted cell lineage tree; this guarantee includes the case where the model tree is highly unresolved, with error defined as the number of false negative branches. Lastly, we outline how quartet-based methods might be employed when there are copy number aberrations and other challenges specific to tumor phylogenetics.

通过重建肿瘤细胞的进化历史,可以了解癌症的进展和治疗。虽然有许多方法可以从分子序列中估计进化树(称为系统发生),但传统的方法假设输入数据是无错误的,输出树是完全解析的。这些假设在肿瘤系统发育学中受到了挑战,因为单细胞测序产生的数据稀疏且充满错误,而且肿瘤是克隆进化的。在此,针对这些障碍,我们研究了基于四叶无根系统发育树的方法的理论效用。我们考虑一种流行的肿瘤系统发育模型,其中突变出现在(高度未解决的)树上,然后引入(无偏)误差和缺失值。四重奏是由两个细胞中存在的突变和两个细胞中不存在的突变所暗示的。我们的主要结果是最可能的四重奏识别了四个单元格上的无根模型树。这促使人们寻找这样一棵树,使它与输入突变之间共享的四元数最大化。证明了该问题的最优解是无根细胞谱系树的一致估计;这种保证包括模型树高度未解析的情况,错误定义为假阴性分支的数量。最后,我们概述了当存在拷贝数畸变和肿瘤系统发育特有的其他挑战时,如何采用基于四分体的方法。
{"title":"Quartets enable statistically consistent estimation of cell lineage trees under an unbiased error and missingness model.","authors":"Yunheng Han, Erin K Molloy","doi":"10.1186/s13015-023-00248-w","DOIUrl":"10.1186/s13015-023-00248-w","url":null,"abstract":"<p><p>Cancer progression and treatment can be informed by reconstructing its evolutionary history from tumor cells. Although many methods exist to estimate evolutionary trees (called phylogenies) from molecular sequences, traditional approaches assume the input data are error-free and the output tree is fully resolved. These assumptions are challenged in tumor phylogenetics because single-cell sequencing produces sparse, error-ridden data and because tumors evolve clonally. Here, we study the theoretical utility of methods based on quartets (four-leaf, unrooted phylogenetic trees) in light of these barriers. We consider a popular tumor phylogenetics model, in which mutations arise on a (highly unresolved) tree and then (unbiased) errors and missing values are introduced. Quartets are then implied by mutations present in two cells and absent from two cells. Our main result is that the most probable quartet identifies the unrooted model tree on four cells. This motivates seeking a tree such that the number of quartets shared between it and the input mutations is maximized. We prove an optimal solution to this problem is a consistent estimator of the unrooted cell lineage tree; this guarantee includes the case where the model tree is highly unresolved, with error defined as the number of false negative branches. Lastly, we outline how quartet-based methods might be employed when there are copy number aberrations and other challenges specific to tumor phylogenetics.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"18 1","pages":"19"},"PeriodicalIF":1.5,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10691101/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138471180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automated design of dynamic programming schemes for RNA folding with pseudoknots. RNA伪结折叠动态规划方案的自动设计。
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2023-12-01 DOI: 10.1186/s13015-023-00229-z
Bertrand Marchand, Sebastian Will, Sarah J Berkemer, Yann Ponty, Laurent Bulteau

Although RNA secondary structure prediction is a textbook application of dynamic programming (DP) and routine task in RNA structure analysis, it remains challenging whenever pseudoknots come into play. Since the prediction of pseudoknotted structures by minimizing (realistically modelled) energy is NP-hard, specialized algorithms have been proposed for restricted conformation classes that capture the most frequently observed configurations. To achieve good performance, these methods rely on specific and carefully hand-crafted DP schemes. In contrast, we generalize and fully automatize the design of DP pseudoknot prediction algorithms. For this purpose, we formalize the problem of designing DP algorithms for an (infinite) class of conformations, modeled by (a finite number of) fatgraphs, and automatically build DP schemes minimizing their algorithmic complexity. We propose an algorithm for the problem, based on the tree-decomposition of a well-chosen representative structure, which we simplify and reinterpret as a DP scheme. The algorithm is fixed-parameter tractable for the treewidth tw of the fatgraph, and its output represents a [Formula: see text] algorithm (and even possibly [Formula: see text] in simple energy models) for predicting the MFE folding of an RNA of length n. We demonstrate, for the most common pseudoknot classes, that our automatically generated algorithms achieve the same complexities as reported in the literature for hand-crafted schemes. Our framework supports general energy models, partition function computations, recursive substructures and partial folding, and could pave the way for algebraic dynamic programming beyond the context-free case.

虽然RNA二级结构预测是动态规划(DP)的教科书应用和RNA结构分析的常规任务,但每当假结发挥作用时,它仍然具有挑战性。由于通过最小化(实际建模)能量来预测伪结结构是np困难的,因此已经提出了用于捕获最常观察到的构型的受限构象类的专门算法。为了获得良好的性能,这些方法依赖于特定的、精心制作的DP方案。相反,我们推广和完全自动化了DP伪结预测算法的设计。为此,我们形式化了为(无限)类构象设计DP算法的问题,由(有限数量)图形建模,并自动构建最小化其算法复杂性的DP方案。我们提出了一个算法来解决这个问题,基于一个精心选择的代表性结构的树分解,我们将其简化并重新解释为一个DP方案。对于脂肪图的树宽tw,该算法是固定参数可处理的,其输出表示用于预测长度为n的RNA的MFE折叠的[公式:参见文本]算法(甚至可能在简单能量模型中[公式:参见文本])。我们证明,对于最常见的伪结类,我们自动生成的算法实现了与文献中报道的手工方案相同的复杂性。我们的框架支持一般的能量模型、配分函数计算、递归子结构和部分折叠,并且可以为超越上下文无关情况的代数动态规划铺平道路。
{"title":"Automated design of dynamic programming schemes for RNA folding with pseudoknots.","authors":"Bertrand Marchand, Sebastian Will, Sarah J Berkemer, Yann Ponty, Laurent Bulteau","doi":"10.1186/s13015-023-00229-z","DOIUrl":"10.1186/s13015-023-00229-z","url":null,"abstract":"<p><p>Although RNA secondary structure prediction is a textbook application of dynamic programming (DP) and routine task in RNA structure analysis, it remains challenging whenever pseudoknots come into play. Since the prediction of pseudoknotted structures by minimizing (realistically modelled) energy is NP-hard, specialized algorithms have been proposed for restricted conformation classes that capture the most frequently observed configurations. To achieve good performance, these methods rely on specific and carefully hand-crafted DP schemes. In contrast, we generalize and fully automatize the design of DP pseudoknot prediction algorithms. For this purpose, we formalize the problem of designing DP algorithms for an (infinite) class of conformations, modeled by (a finite number of) fatgraphs, and automatically build DP schemes minimizing their algorithmic complexity. We propose an algorithm for the problem, based on the tree-decomposition of a well-chosen representative structure, which we simplify and reinterpret as a DP scheme. The algorithm is fixed-parameter tractable for the treewidth tw of the fatgraph, and its output represents a [Formula: see text] algorithm (and even possibly [Formula: see text] in simple energy models) for predicting the MFE folding of an RNA of length n. We demonstrate, for the most common pseudoknot classes, that our automatically generated algorithms achieve the same complexities as reported in the literature for hand-crafted schemes. Our framework supports general energy models, partition function computations, recursive substructures and partial folding, and could pave the way for algebraic dynamic programming beyond the context-free case.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"18 1","pages":"18"},"PeriodicalIF":1.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10691146/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138471179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
New algorithms for structure informed genome rearrangement. 结构信息基因组重排的新算法。
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2023-12-01 DOI: 10.1186/s13015-023-00239-x
Eden Ozeri, Meirav Zehavi, Michal Ziv-Ukelson

We define two new computational problems in the domain of perfect genome rearrangements, and propose three algorithms to solve them. The rearrangement scenarios modeled by the problems consider Reversal and Block Interchange operations, and a PQ-tree is utilized to guide the allowed operations and to compute their weights. In the first problem, [Formula: see text] ([Formula: see text]), we define the basic structure-informed rearrangement measure. Here, we assume that the gene order members of the gene cluster from which the PQ-tree is constructed are permutations. The PQ-tree representing the gene cluster is ordered such that the series of gene IDs spelled by its leaves is equivalent to that of the reference gene order. Then, a structure-informed genome rearrangement distance is computed between the ordered PQ-tree and the target gene order. The second problem, [Formula: see text] ([Formula: see text]), generalizes [Formula: see text], where the gene order members are not necessarily permutations and the structure informed rearrangement measure is extended to also consider up to [Formula: see text] and [Formula: see text] gene insertion and deletion operations, respectively, when modelling the PQ-tree informed divergence process from the reference gene order to the target gene order. The first algorithm solves [Formula: see text] in [Formula: see text] time and [Formula: see text] space, where [Formula: see text] is the maximum number of children of a node, n is the length of the string and the number of leaves in the tree, and [Formula: see text] and [Formula: see text] are the number of P-nodes and Q-nodes in the tree, respectively. If one of the penalties of [Formula: see text] is 0, then the algorithm runs in [Formula: see text] time and [Formula: see text] space. The second algorithm solves [Formula: see text] in [Formula: see text] time and [Formula: see text] space, where [Formula: see text] is the maximum number of children of a node, n is the length of the string, m is the number of leaves in the tree, [Formula: see text] and [Formula: see text] are the number of P-nodes and Q-nodes in the tree, respectively, and allowing up to [Formula: see text] deletions from the tree and up to [Formula: see text] deletions from the string. The third algorithm is intended to reduce the space complexity of the second algorithm. It solves a variant of the problem (where one of the penalties of [Formula: see text] is 0) in [Formula: see text] time and [Formula: see text] space. The algorithm is implemented as a software tool, denoted MEM-Rearrange, and applied to the comparative and evolutionary analysis of 59 chromosomal gene clusters extracted from a dataset of 1487 prokaryotic genomes.

我们定义了完美基因组重排领域的两个新的计算问题,并提出了三种算法来解决它们。该问题建模的重排场景考虑了反转和块交换操作,并使用pq树来指导允许的操作并计算其权重。在第一个问题[公式:见文]([公式:见文])中,我们定义了基本的基于结构的重排度量。在这里,我们假设构建pq树的基因簇的基因顺序成员是排列。表示基因簇的pq树是有序的,其叶子拼写的一系列基因id与参考基因序列相等。然后,计算有序pq树和目标基因序列之间的结构信息基因组重排距离。第二个问题,[公式:见文]([公式:见文]),推广了[公式:见文],其中基因序列成员不一定是排列,并且结构通知重排措施被扩展到分别考虑[公式:见文]和[公式:见文]基因插入和删除操作,当建模pq树通知从参考基因序列到目标基因序列的发散过程时。第一种算法在[公式:见文]时间和[公式:见文]空间中求解[公式:见文],其中[公式:见文]为节点的最大子节点数,n为字符串长度和树中叶子的个数,[公式:见文]和[公式:见文]分别为树中p节点和q节点的个数。如果[Formula: see text]的其中一个惩罚为0,则算法在[Formula: see text]时间和[Formula: see text]空间中运行。第二个算法解决[公式:看到文本][公式:看到文本][公式:看到文本]空间,(公式:看到文本)是儿童的最大数量的节点,n是字符串的长度,m是树中的叶子,[公式:看到文本]和[公式:看到文本]P-nodes和Q-nodes树的数量,分别和允许[公式:看到文本]删除从树上,[公式:看到文本]删除字符串。第三种算法旨在降低第二种算法的空间复杂度。它在[公式:见文本]时间和[公式:见文本]空间中解决了问题的一个变体(其中[公式:见文本]的惩罚之一是0)。该算法作为一个软件工具实现,命名为memm - rearrange,并应用于从1487个原核生物基因组数据集中提取的59个染色体基因簇的比较和进化分析。
{"title":"New algorithms for structure informed genome rearrangement.","authors":"Eden Ozeri, Meirav Zehavi, Michal Ziv-Ukelson","doi":"10.1186/s13015-023-00239-x","DOIUrl":"10.1186/s13015-023-00239-x","url":null,"abstract":"<p><p>We define two new computational problems in the domain of perfect genome rearrangements, and propose three algorithms to solve them. The rearrangement scenarios modeled by the problems consider Reversal and Block Interchange operations, and a PQ-tree is utilized to guide the allowed operations and to compute their weights. In the first problem, [Formula: see text] ([Formula: see text]), we define the basic structure-informed rearrangement measure. Here, we assume that the gene order members of the gene cluster from which the PQ-tree is constructed are permutations. The PQ-tree representing the gene cluster is ordered such that the series of gene IDs spelled by its leaves is equivalent to that of the reference gene order. Then, a structure-informed genome rearrangement distance is computed between the ordered PQ-tree and the target gene order. The second problem, [Formula: see text] ([Formula: see text]), generalizes [Formula: see text], where the gene order members are not necessarily permutations and the structure informed rearrangement measure is extended to also consider up to [Formula: see text] and [Formula: see text] gene insertion and deletion operations, respectively, when modelling the PQ-tree informed divergence process from the reference gene order to the target gene order. The first algorithm solves [Formula: see text] in [Formula: see text] time and [Formula: see text] space, where [Formula: see text] is the maximum number of children of a node, n is the length of the string and the number of leaves in the tree, and [Formula: see text] and [Formula: see text] are the number of P-nodes and Q-nodes in the tree, respectively. If one of the penalties of [Formula: see text] is 0, then the algorithm runs in [Formula: see text] time and [Formula: see text] space. The second algorithm solves [Formula: see text] in [Formula: see text] time and [Formula: see text] space, where [Formula: see text] is the maximum number of children of a node, n is the length of the string, m is the number of leaves in the tree, [Formula: see text] and [Formula: see text] are the number of P-nodes and Q-nodes in the tree, respectively, and allowing up to [Formula: see text] deletions from the tree and up to [Formula: see text] deletions from the string. The third algorithm is intended to reduce the space complexity of the second algorithm. It solves a variant of the problem (where one of the penalties of [Formula: see text] is 0) in [Formula: see text] time and [Formula: see text] space. The algorithm is implemented as a software tool, denoted MEM-Rearrange, and applied to the comparative and evolutionary analysis of 59 chromosomal gene clusters extracted from a dataset of 1487 prokaryotic genomes.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"18 1","pages":"17"},"PeriodicalIF":1.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10691145/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138464177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Relative timing information and orthology in evolutionary scenarios. 进化场景中的相对时序信息和正交性。
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2023-11-08 DOI: 10.1186/s13015-023-00240-4
David Schaller, Tom Hartmann, Manuel Lafond, Peter F Stadler, Nicolas Wieseke, Marc Hellmuth

Background: Evolutionary scenarios describing the evolution of a family of genes within a collection of species comprise the mapping of the vertices of a gene tree T to vertices and edges of a species tree S. The relative timing of the last common ancestors of two extant genes (leaves of T) and the last common ancestors of the two species (leaves of S) in which they reside is indicative of horizontal gene transfers (HGT) and ancient duplications. Orthologous gene pairs, on the other hand, require that their last common ancestors coincides with a corresponding speciation event. The relative timing information of gene and species divergences is captured by three colored graphs that have the extant genes as vertices and the species in which the genes are found as vertex colors: the equal-divergence-time (EDT) graph, the later-divergence-time (LDT) graph and the prior-divergence-time (PDT) graph, which together form an edge partition of the complete graph.

Results: Here we give a complete characterization in terms of informative and forbidden triples that can be read off the three graphs and provide a polynomial time algorithm for constructing an evolutionary scenario that explains the graphs, provided such a scenario exists. While both LDT and PDT graphs are cographs, this is not true for the EDT graph in general. We show that every EDT graph is perfect. While the information about LDT and PDT graphs is necessary to recognize EDT graphs in polynomial-time for general scenarios, this extra information can be dropped in the HGT-free case. However, recognition of EDT graphs without knowledge of putative LDT and PDT graphs is NP-complete for general scenarios. In contrast, PDT graphs can be recognized in polynomial-time. We finally connect the EDT graph to the alternative definitions of orthology that have been proposed for scenarios with horizontal gene transfer. With one exception, the corresponding graphs are shown to be colored cographs.

背景:描述物种集合中基因家族进化的进化场景包括基因树T的顶点到物种树S的顶点和边的映射。两个现存基因(T的叶子)的最后共同祖先和它们所在的两个物种(S的叶子)最后共同祖先的相对时间指示水平基因转移(HGT)和古代复制。另一方面,同源基因对要求它们最后的共同祖先与相应的物种形成事件重合。基因和物种分化的相对时间信息由三个彩色图捕获,这些图以现存基因为顶点,以发现基因的物种为顶点颜色:等分化时间(EDT)图、后分化时间(LDT)图和前分化时间(PDT)图,它们共同形成了完整图的边缘划分。结果:在这里,我们根据可以从三个图中读取的信息和禁止三元组给出了一个完整的刻画,并提供了一个多项式时间算法来构建解释图的进化场景,前提是存在这样的场景。虽然LDT和PDT图都是cograph,但对于EDT图来说,这通常不是真的。我们证明了每个EDT图都是完美的。虽然在一般情况下,关于LDT和PDT图的信息对于在多项式时间内识别EDT图是必要的,但在无HGT的情况下,可以删除这些额外信息。然而,在不知道假定的LDT和PDT图的情况下,对EDT图的识别对于一般情况是NP完全的。相比之下,PDT图可以在多项式时间内识别。最后,我们将EDT图与针对水平基因转移场景提出的矫正学的替代定义联系起来。除了一个例外,相应的图被显示为有色的cograph。
{"title":"Relative timing information and orthology in evolutionary scenarios.","authors":"David Schaller, Tom Hartmann, Manuel Lafond, Peter F Stadler, Nicolas Wieseke, Marc Hellmuth","doi":"10.1186/s13015-023-00240-4","DOIUrl":"10.1186/s13015-023-00240-4","url":null,"abstract":"<p><strong>Background: </strong>Evolutionary scenarios describing the evolution of a family of genes within a collection of species comprise the mapping of the vertices of a gene tree T to vertices and edges of a species tree S. The relative timing of the last common ancestors of two extant genes (leaves of T) and the last common ancestors of the two species (leaves of S) in which they reside is indicative of horizontal gene transfers (HGT) and ancient duplications. Orthologous gene pairs, on the other hand, require that their last common ancestors coincides with a corresponding speciation event. The relative timing information of gene and species divergences is captured by three colored graphs that have the extant genes as vertices and the species in which the genes are found as vertex colors: the equal-divergence-time (EDT) graph, the later-divergence-time (LDT) graph and the prior-divergence-time (PDT) graph, which together form an edge partition of the complete graph.</p><p><strong>Results: </strong>Here we give a complete characterization in terms of informative and forbidden triples that can be read off the three graphs and provide a polynomial time algorithm for constructing an evolutionary scenario that explains the graphs, provided such a scenario exists. While both LDT and PDT graphs are cographs, this is not true for the EDT graph in general. We show that every EDT graph is perfect. While the information about LDT and PDT graphs is necessary to recognize EDT graphs in polynomial-time for general scenarios, this extra information can be dropped in the HGT-free case. However, recognition of EDT graphs without knowledge of putative LDT and PDT graphs is NP-complete for general scenarios. In contrast, PDT graphs can be recognized in polynomial-time. We finally connect the EDT graph to the alternative definitions of orthology that have been proposed for scenarios with horizontal gene transfer. With one exception, the corresponding graphs are shown to be colored cographs.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"18 1","pages":"16"},"PeriodicalIF":1.0,"publicationDate":"2023-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10634191/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71523304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Algorithms for Molecular Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1