Algorithms for Molecular Biology最新文献_第6页

Correction: Constructing founder sets under allelic and non-allelic homologous recombination. 更正:在等位基因和非等位基因同源重组下构建方正集。

IF 1 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology

Pub Date : 2023-12-06 DOI: 10.1186/s13015-023-00244-0

Konstantinn Bonnet, Tobias Marschall, Daniel Doerr

引用次数: 0

Quartets enable statistically consistent estimation of cell lineage trees under an unbiased error and missingness model. 在无偏误差和缺失模型下，四重奏使细胞谱系树的统计一致估计成为可能。

IF 1.5 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology

Pub Date : 2023-12-01 DOI: 10.1186/s13015-023-00248-w

Yunheng Han, Erin K Molloy

Cancer progression and treatment can be informed by reconstructing its evolutionary history from tumor cells. Although many methods exist to estimate evolutionary trees (called phylogenies) from molecular sequences, traditional approaches assume the input data are error-free and the output tree is fully resolved. These assumptions are challenged in tumor phylogenetics because single-cell sequencing produces sparse, error-ridden data and because tumors evolve clonally. Here, we study the theoretical utility of methods based on quartets (four-leaf, unrooted phylogenetic trees) in light of these barriers. We consider a popular tumor phylogenetics model, in which mutations arise on a (highly unresolved) tree and then (unbiased) errors and missing values are introduced. Quartets are then implied by mutations present in two cells and absent from two cells. Our main result is that the most probable quartet identifies the unrooted model tree on four cells. This motivates seeking a tree such that the number of quartets shared between it and the input mutations is maximized. We prove an optimal solution to this problem is a consistent estimator of the unrooted cell lineage tree; this guarantee includes the case where the model tree is highly unresolved, with error defined as the number of false negative branches. Lastly, we outline how quartet-based methods might be employed when there are copy number aberrations and other challenges specific to tumor phylogenetics.

通过重建肿瘤细胞的进化历史，可以了解癌症的进展和治疗。虽然有许多方法可以从分子序列中估计进化树(称为系统发生)，但传统的方法假设输入数据是无错误的，输出树是完全解析的。这些假设在肿瘤系统发育学中受到了挑战，因为单细胞测序产生的数据稀疏且充满错误，而且肿瘤是克隆进化的。在此，针对这些障碍，我们研究了基于四叶无根系统发育树的方法的理论效用。我们考虑一种流行的肿瘤系统发育模型，其中突变出现在(高度未解决的)树上，然后引入(无偏)误差和缺失值。四重奏是由两个细胞中存在的突变和两个细胞中不存在的突变所暗示的。我们的主要结果是最可能的四重奏识别了四个单元格上的无根模型树。这促使人们寻找这样一棵树，使它与输入突变之间共享的四元数最大化。证明了该问题的最优解是无根细胞谱系树的一致估计;这种保证包括模型树高度未解析的情况，错误定义为假阴性分支的数量。最后，我们概述了当存在拷贝数畸变和肿瘤系统发育特有的其他挑战时，如何采用基于四分体的方法。

{"title":"Quartets enable statistically consistent estimation of cell lineage trees under an unbiased error and missingness model.","authors":"Yunheng Han, Erin K Molloy","doi":"10.1186/s13015-023-00248-w","DOIUrl":"10.1186/s13015-023-00248-w","url":null,"abstract":"Cancer progression and treatment can be informed by reconstructing its evolutionary history from tumor cells. Although many methods exist to estimate evolutionary trees (called phylogenies) from molecular sequences, traditional approaches assume the input data are error-free and the output tree is fully resolved. These assumptions are challenged in tumor phylogenetics because single-cell sequencing produces sparse, error-ridden data and because tumors evolve clonally. Here, we study the theoretical utility of methods based on quartets (four-leaf, unrooted phylogenetic trees) in light of these barriers. We consider a popular tumor phylogenetics model, in which mutations arise on a (highly unresolved) tree and then (unbiased) errors and missing values are introduced. Quartets are then implied by mutations present in two cells and absent from two cells. Our main result is that the most probable quartet identifies the unrooted model tree on four cells. This motivates seeking a tree such that the number of quartets shared between it and the input mutations is maximized. We prove an optimal solution to this problem is a consistent estimator of the unrooted cell lineage tree; this guarantee includes the case where the model tree is highly unresolved, with error defined as the number of false negative branches. Lastly, we outline how quartet-based methods might be employed when there are copy number aberrations and other challenges specific to tumor phylogenetics.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"18 1","pages":"19"},"PeriodicalIF":1.5,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10691101/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138471180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automated design of dynamic programming schemes for RNA folding with pseudoknots. RNA伪结折叠动态规划方案的自动设计。

IF 1 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology

Pub Date : 2023-12-01 DOI: 10.1186/s13015-023-00229-z

Bertrand Marchand, Sebastian Will, Sarah J Berkemer, Yann Ponty, Laurent Bulteau

Although RNA secondary structure prediction is a textbook application of dynamic programming (DP) and routine task in RNA structure analysis, it remains challenging whenever pseudoknots come into play. Since the prediction of pseudoknotted structures by minimizing (realistically modelled) energy is NP-hard, specialized algorithms have been proposed for restricted conformation classes that capture the most frequently observed configurations. To achieve good performance, these methods rely on specific and carefully hand-crafted DP schemes. In contrast, we generalize and fully automatize the design of DP pseudoknot prediction algorithms. For this purpose, we formalize the problem of designing DP algorithms for an (infinite) class of conformations, modeled by (a finite number of) fatgraphs, and automatically build DP schemes minimizing their algorithmic complexity. We propose an algorithm for the problem, based on the tree-decomposition of a well-chosen representative structure, which we simplify and reinterpret as a DP scheme. The algorithm is fixed-parameter tractable for the treewidth tw of the fatgraph, and its output represents a [Formula: see text] algorithm (and even possibly [Formula: see text] in simple energy models) for predicting the MFE folding of an RNA of length n. We demonstrate, for the most common pseudoknot classes, that our automatically generated algorithms achieve the same complexities as reported in the literature for hand-crafted schemes. Our framework supports general energy models, partition function computations, recursive substructures and partial folding, and could pave the way for algebraic dynamic programming beyond the context-free case.

虽然RNA二级结构预测是动态规划(DP)的教科书应用和RNA结构分析的常规任务，但每当假结发挥作用时，它仍然具有挑战性。由于通过最小化(实际建模)能量来预测伪结结构是np困难的，因此已经提出了用于捕获最常观察到的构型的受限构象类的专门算法。为了获得良好的性能，这些方法依赖于特定的、精心制作的DP方案。相反，我们推广和完全自动化了DP伪结预测算法的设计。为此，我们形式化了为(无限)类构象设计DP算法的问题，由(有限数量)图形建模，并自动构建最小化其算法复杂性的DP方案。我们提出了一个算法来解决这个问题，基于一个精心选择的代表性结构的树分解，我们将其简化并重新解释为一个DP方案。对于脂肪图的树宽tw，该算法是固定参数可处理的，其输出表示用于预测长度为n的RNA的MFE折叠的[公式:参见文本]算法(甚至可能在简单能量模型中[公式:参见文本])。我们证明，对于最常见的伪结类，我们自动生成的算法实现了与文献中报道的手工方案相同的复杂性。我们的框架支持一般的能量模型、配分函数计算、递归子结构和部分折叠，并且可以为超越上下文无关情况的代数动态规划铺平道路。

{"title":"Automated design of dynamic programming schemes for RNA folding with pseudoknots.","authors":"Bertrand Marchand, Sebastian Will, Sarah J Berkemer, Yann Ponty, Laurent Bulteau","doi":"10.1186/s13015-023-00229-z","DOIUrl":"10.1186/s13015-023-00229-z","url":null,"abstract":"Although RNA secondary structure prediction is a textbook application of dynamic programming (DP) and routine task in RNA structure analysis, it remains challenging whenever pseudoknots come into play. Since the prediction of pseudoknotted structures by minimizing (realistically modelled) energy is NP-hard, specialized algorithms have been proposed for restricted conformation classes that capture the most frequently observed configurations. To achieve good performance, these methods rely on specific and carefully hand-crafted DP schemes. In contrast, we generalize and fully automatize the design of DP pseudoknot prediction algorithms. For this purpose, we formalize the problem of designing DP algorithms for an (infinite) class of conformations, modeled by (a finite number of) fatgraphs, and automatically build DP schemes minimizing their algorithmic complexity. We propose an algorithm for the problem, based on the tree-decomposition of a well-chosen representative structure, which we simplify and reinterpret as a DP scheme. The algorithm is fixed-parameter tractable for the treewidth tw of the fatgraph, and its output represents a [Formula: see text] algorithm (and even possibly [Formula: see text] in simple energy models) for predicting the MFE folding of an RNA of length n. We demonstrate, for the most common pseudoknot classes, that our automatically generated algorithms achieve the same complexities as reported in the literature for hand-crafted schemes. Our framework supports general energy models, partition function computations, recursive substructures and partial folding, and could pave the way for algebraic dynamic programming beyond the context-free case.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"18 1","pages":"18"},"PeriodicalIF":1.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10691146/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138471179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

New algorithms for structure informed genome rearrangement. 结构信息基因组重排的新算法。

IF 1 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology

Pub Date : 2023-12-01 DOI: 10.1186/s13015-023-00239-x

Eden Ozeri, Meirav Zehavi, Michal Ziv-Ukelson

We define two new computational problems in the domain of perfect genome rearrangements, and propose three algorithms to solve them. The rearrangement scenarios modeled by the problems consider Reversal and Block Interchange operations, and a PQ-tree is utilized to guide the allowed operations and to compute their weights. In the first problem, [Formula: see text] ([Formula: see text]), we define the basic structure-informed rearrangement measure. Here, we assume that the gene order members of the gene cluster from which the PQ-tree is constructed are permutations. The PQ-tree representing the gene cluster is ordered such that the series of gene IDs spelled by its leaves is equivalent to that of the reference gene order. Then, a structure-informed genome rearrangement distance is computed between the ordered PQ-tree and the target gene order. The second problem, [Formula: see text] ([Formula: see text]), generalizes [Formula: see text], where the gene order members are not necessarily permutations and the structure informed rearrangement measure is extended to also consider up to [Formula: see text] and [Formula: see text] gene insertion and deletion operations, respectively, when modelling the PQ-tree informed divergence process from the reference gene order to the target gene order. The first algorithm solves [Formula: see text] in [Formula: see text] time and [Formula: see text] space, where [Formula: see text] is the maximum number of children of a node, n is the length of the string and the number of leaves in the tree, and [Formula: see text] and [Formula: see text] are the number of P-nodes and Q-nodes in the tree, respectively. If one of the penalties of [Formula: see text] is 0, then the algorithm runs in [Formula: see text] time and [Formula: see text] space. The second algorithm solves [Formula: see text] in [Formula: see text] time and [Formula: see text] space, where [Formula: see text] is the maximum number of children of a node, n is the length of the string, m is the number of leaves in the tree, [Formula: see text] and [Formula: see text] are the number of P-nodes and Q-nodes in the tree, respectively, and allowing up to [Formula: see text] deletions from the tree and up to [Formula: see text] deletions from the string. The third algorithm is intended to reduce the space complexity of the second algorithm. It solves a variant of the problem (where one of the penalties of [Formula: see text] is 0) in [Formula: see text] time and [Formula: see text] space. The algorithm is implemented as a software tool, denoted MEM-Rearrange, and applied to the comparative and evolutionary analysis of 59 chromosomal gene clusters extracted from a dataset of 1487 prokaryotic genomes.

我们定义了完美基因组重排领域的两个新的计算问题，并提出了三种算法来解决它们。该问题建模的重排场景考虑了反转和块交换操作，并使用pq树来指导允许的操作并计算其权重。在第一个问题[公式:见文]([公式:见文])中，我们定义了基本的基于结构的重排度量。在这里，我们假设构建pq树的基因簇的基因顺序成员是排列。表示基因簇的pq树是有序的，其叶子拼写的一系列基因id与参考基因序列相等。然后，计算有序pq树和目标基因序列之间的结构信息基因组重排距离。第二个问题，[公式:见文]([公式:见文])，推广了[公式:见文]，其中基因序列成员不一定是排列，并且结构通知重排措施被扩展到分别考虑[公式:见文]和[公式:见文]基因插入和删除操作，当建模pq树通知从参考基因序列到目标基因序列的发散过程时。第一种算法在[公式:见文]时间和[公式:见文]空间中求解[公式:见文]，其中[公式:见文]为节点的最大子节点数，n为字符串长度和树中叶子的个数，[公式:见文]和[公式:见文]分别为树中p节点和q节点的个数。如果[Formula: see text]的其中一个惩罚为0，则算法在[Formula: see text]时间和[Formula: see text]空间中运行。第二个算法解决[公式:看到文本][公式:看到文本][公式:看到文本]空间,(公式:看到文本)是儿童的最大数量的节点,n是字符串的长度,m是树中的叶子,[公式:看到文本]和[公式:看到文本]P-nodes和Q-nodes树的数量,分别和允许[公式:看到文本]删除从树上,[公式:看到文本]删除字符串。第三种算法旨在降低第二种算法的空间复杂度。它在[公式:见文本]时间和[公式:见文本]空间中解决了问题的一个变体(其中[公式:见文本]的惩罚之一是0)。该算法作为一个软件工具实现，命名为memm - rearrange，并应用于从1487个原核生物基因组数据集中提取的59个染色体基因簇的比较和进化分析。

{"title":"New algorithms for structure informed genome rearrangement.","authors":"Eden Ozeri, Meirav Zehavi, Michal Ziv-Ukelson","doi":"10.1186/s13015-023-00239-x","DOIUrl":"10.1186/s13015-023-00239-x","url":null,"abstract":"We define two new computational problems in the domain of perfect genome rearrangements, and propose three algorithms to solve them. The rearrangement scenarios modeled by the problems consider Reversal and Block Interchange operations, and a PQ-tree is utilized to guide the allowed operations and to compute their weights. In the first problem, [Formula: see text] ([Formula: see text]), we define the basic structure-informed rearrangement measure. Here, we assume that the gene order members of the gene cluster from which the PQ-tree is constructed are permutations. The PQ-tree representing the gene cluster is ordered such that the series of gene IDs spelled by its leaves is equivalent to that of the reference gene order. Then, a structure-informed genome rearrangement distance is computed between the ordered PQ-tree and the target gene order. The second problem, [Formula: see text] ([Formula: see text]), generalizes [Formula: see text], where the gene order members are not necessarily permutations and the structure informed rearrangement measure is extended to also consider up to [Formula: see text] and [Formula: see text] gene insertion and deletion operations, respectively, when modelling the PQ-tree informed divergence process from the reference gene order to the target gene order. The first algorithm solves [Formula: see text] in [Formula: see text] time and [Formula: see text] space, where [Formula: see text] is the maximum number of children of a node, n is the length of the string and the number of leaves in the tree, and [Formula: see text] and [Formula: see text] are the number of P-nodes and Q-nodes in the tree, respectively. If one of the penalties of [Formula: see text] is 0, then the algorithm runs in [Formula: see text] time and [Formula: see text] space. The second algorithm solves [Formula: see text] in [Formula: see text] time and [Formula: see text] space, where [Formula: see text] is the maximum number of children of a node, n is the length of the string, m is the number of leaves in the tree, [Formula: see text] and [Formula: see text] are the number of P-nodes and Q-nodes in the tree, respectively, and allowing up to [Formula: see text] deletions from the tree and up to [Formula: see text] deletions from the string. The third algorithm is intended to reduce the space complexity of the second algorithm. It solves a variant of the problem (where one of the penalties of [Formula: see text] is 0) in [Formula: see text] time and [Formula: see text] space. The algorithm is implemented as a software tool, denoted MEM-Rearrange, and applied to the comparative and evolutionary analysis of 59 chromosomal gene clusters extracted from a dataset of 1487 prokaryotic genomes.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"18 1","pages":"17"},"PeriodicalIF":1.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10691145/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138464177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Relative timing information and orthology in evolutionary scenarios. 进化场景中的相对时序信息和正交性。

IF 1 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology

Pub Date : 2023-11-08 DOI: 10.1186/s13015-023-00240-4

David Schaller, Tom Hartmann, Manuel Lafond, Peter F Stadler, Nicolas Wieseke, Marc Hellmuth

Background: Evolutionary scenarios describing the evolution of a family of genes within a collection of species comprise the mapping of the vertices of a gene tree T to vertices and edges of a species tree S. The relative timing of the last common ancestors of two extant genes (leaves of T) and the last common ancestors of the two species (leaves of S) in which they reside is indicative of horizontal gene transfers (HGT) and ancient duplications. Orthologous gene pairs, on the other hand, require that their last common ancestors coincides with a corresponding speciation event. The relative timing information of gene and species divergences is captured by three colored graphs that have the extant genes as vertices and the species in which the genes are found as vertex colors: the equal-divergence-time (EDT) graph, the later-divergence-time (LDT) graph and the prior-divergence-time (PDT) graph, which together form an edge partition of the complete graph.

Results: Here we give a complete characterization in terms of informative and forbidden triples that can be read off the three graphs and provide a polynomial time algorithm for constructing an evolutionary scenario that explains the graphs, provided such a scenario exists. While both LDT and PDT graphs are cographs, this is not true for the EDT graph in general. We show that every EDT graph is perfect. While the information about LDT and PDT graphs is necessary to recognize EDT graphs in polynomial-time for general scenarios, this extra information can be dropped in the HGT-free case. However, recognition of EDT graphs without knowledge of putative LDT and PDT graphs is NP-complete for general scenarios. In contrast, PDT graphs can be recognized in polynomial-time. We finally connect the EDT graph to the alternative definitions of orthology that have been proposed for scenarios with horizontal gene transfer. With one exception, the corresponding graphs are shown to be colored cographs.

背景：描述物种集合中基因家族进化的进化场景包括基因树T的顶点到物种树S的顶点和边的映射。两个现存基因（T的叶子）的最后共同祖先和它们所在的两个物种（S的叶子）最后共同祖先的相对时间指示水平基因转移（HGT）和古代复制。另一方面，同源基因对要求它们最后的共同祖先与相应的物种形成事件重合。基因和物种分化的相对时间信息由三个彩色图捕获，这些图以现存基因为顶点，以发现基因的物种为顶点颜色：等分化时间（EDT）图、后分化时间（LDT）图和前分化时间（PDT）图，它们共同形成了完整图的边缘划分。结果：在这里，我们根据可以从三个图中读取的信息和禁止三元组给出了一个完整的刻画，并提供了一个多项式时间算法来构建解释图的进化场景，前提是存在这样的场景。虽然LDT和PDT图都是cograph，但对于EDT图来说，这通常不是真的。我们证明了每个EDT图都是完美的。虽然在一般情况下，关于LDT和PDT图的信息对于在多项式时间内识别EDT图是必要的，但在无HGT的情况下，可以删除这些额外信息。然而，在不知道假定的LDT和PDT图的情况下，对EDT图的识别对于一般情况是NP完全的。相比之下，PDT图可以在多项式时间内识别。最后，我们将EDT图与针对水平基因转移场景提出的矫正学的替代定义联系起来。除了一个例外，相应的图被显示为有色的cograph。

{"title":"Relative timing information and orthology in evolutionary scenarios.","authors":"David Schaller, Tom Hartmann, Manuel Lafond, Peter F Stadler, Nicolas Wieseke, Marc Hellmuth","doi":"10.1186/s13015-023-00240-4","DOIUrl":"10.1186/s13015-023-00240-4","url":null,"abstract":"Background: Evolutionary scenarios describing the evolution of a family of genes within a collection of species comprise the mapping of the vertices of a gene tree T to vertices and edges of a species tree S. The relative timing of the last common ancestors of two extant genes (leaves of T) and the last common ancestors of the two species (leaves of S) in which they reside is indicative of horizontal gene transfers (HGT) and ancient duplications. Orthologous gene pairs, on the other hand, require that their last common ancestors coincides with a corresponding speciation event. The relative timing information of gene and species divergences is captured by three colored graphs that have the extant genes as vertices and the species in which the genes are found as vertex colors: the equal-divergence-time (EDT) graph, the later-divergence-time (LDT) graph and the prior-divergence-time (PDT) graph, which together form an edge partition of the complete graph.Results: Here we give a complete characterization in terms of informative and forbidden triples that can be read off the three graphs and provide a polynomial time algorithm for constructing an evolutionary scenario that explains the graphs, provided such a scenario exists. While both LDT and PDT graphs are cographs, this is not true for the EDT graph in general. We show that every EDT graph is perfect. While the information about LDT and PDT graphs is necessary to recognize EDT graphs in polynomial-time for general scenarios, this extra information can be dropped in the HGT-free case. However, recognition of EDT graphs without knowledge of putative LDT and PDT graphs is NP-complete for general scenarios. In contrast, PDT graphs can be recognized in polynomial-time. We finally connect the EDT graph to the alternative definitions of orthology that have been proposed for scenarios with horizontal gene transfer. With one exception, the corresponding graphs are shown to be colored cographs.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"18 1","pages":"16"},"PeriodicalIF":1.0,"publicationDate":"2023-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10634191/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71523304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

On a greedy approach for genome scaffolding. 贪婪的基因组支架方法。

IF 1 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology

Pub Date : 2022-10-29 DOI: 10.1186/s13015-022-00223-x

Tom Davot, Annie Chateau, Rohan Fossé, Rodolphe Giroudeau, Mathias Weller

Background: Scaffolding is a bioinformatics problem aimed at completing the contig assembly process by determining the relative position and orientation of these contigs. It can be seen as a paths and cycles cover problem of a particular graph called the "scaffold graph".

Results: We provide some NP-hardness and inapproximability results on this problem. We also adapt a greedy approximation algorithm on complete graphs so that it works on a special class aiming to be close to real instances. The described algorithm is the first polynomial-time approximation algorithm designed for this problem on non-complete graphs.

Conclusion: Tests on a set of simulated instances show that our algorithm provides better results than the version on complete graphs.

背景:脚手架是一个生物信息学问题，旨在通过确定这些组群的相对位置和方向来完成组群组装过程。它可以看作是一个被称为“脚手架图”的特定图的路径和循环覆盖问题。结果:对该问题给出了一些np -硬度和不可逼近性的结果。我们还在完全图上采用了贪婪逼近算法，使其适用于旨在接近真实实例的特殊类。所描述的算法是针对非完全图问题设计的第一个多项式时间逼近算法。结论:在一组模拟实例上的测试表明，我们的算法比完全图上的版本提供了更好的结果。

引用次数: 0

Treewidth-based algorithms for the small parsimony problem on networks. 基于树宽的网络小简约问题算法。

IF 1 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology

Pub Date : 2022-08-20 DOI: 10.1186/s13015-022-00216-w

Celine Scornavacca, Mathias Weller

Background: Phylogenetic reconstruction is one of the paramount challenges of contemporary bioinformatics. A subtask of existing tree reconstruction algorithms is modeled by the SMALL PARSIMONY problem: given a tree T and an assignment of character-states to its leaves, assign states to the internal nodes of T such as to minimize the parsimony score, that is, the number of edges of T connecting nodes with different states. While this problem is polynomial-time solvable on trees, the matter is more complicated if T contains reticulate events such as hybridizations or recombinations, i.e. when T is a network. Indeed, three different versions of the parsimony score on networks have been proposed and each of them is NP-hard to decide. Existing parameterized algorithms focus on combining the number c of possible character-states with the number of reticulate events (per biconnected component).

Results: We consider the parameter treewidth t of the underlying undirected graph of the input network, presenting dynamic programming algorithms for (slight generalizations of) all three versions of the parsimony problem on size-n networks running in times [Formula: see text], [Formula: see text], and [Formula: see text], respectively. Our algorithms use a formulation of the treewidth that may facilitate formalizing treewidth-based dynamic programming algorithms on phylogenetic networks for other problems.

Conclusions: Our algorithms allow the computation of the three popular parsimony scores, modeling the evolutionary development of a (multistate) character on a given phylogenetic network of low treewidth. Our results subsume and improve previously known algorithm for all three variants. While our results rely on being given a "good" tree-decomposition of the input, encouraging theoretical results as well as practical implementations producing them are publicly available. We present a reformulation of tree decompositions in terms of "agreeing trees" on the same set of nodes. As this formulation may come more natural to researchers and engineers developing algorithms for phylogenetic networks, we hope to render exploiting the input network's treewidth as parameter more accessible to this audience.

背景:系统发育重建是当代生物信息学最重要的挑战之一。利用SMALL PARSIMONY问题对现有树重建算法的一个子任务进行建模:给定一棵树T，并将特征状态分配给它的叶子，将状态分配给T的内部节点，例如最小化PARSIMONY得分，即T连接不同状态节点的边数。虽然这个问题在树上是多项式时间可解的，但如果T包含网状事件，如杂交或重组，即当T是一个网络时，问题会更加复杂。事实上，已经提出了三种不同版本的网络节俭评分，每一种都是NP-hard难以决定的。现有的参数化算法侧重于将可能的特征状态数c与网状事件数(每个双连接组件)相结合。结果:我们考虑输入网络底层无向图的参数树宽t，分别为运行时间为[公式:见文]、[公式:见文]和[公式:见文]的size-n网络上的所有三个版本的简约性问题提出了动态规划算法(轻微推广)。我们的算法使用树宽度的公式，这可能有助于形式化系统发育网络上基于树宽度的动态规划算法，以解决其他问题。结论:我们的算法允许计算三种流行的节俭分数，在给定的低树宽系统发育网络上模拟(多状态)特征的进化发展。我们的结果包含并改进了所有三种变体的已知算法。虽然我们的结果依赖于给定输入的“良好”树分解，但令人鼓舞的理论结果以及产生它们的实际实现都是公开可用的。我们提出了一种基于相同节点集上的“同意树”的树分解的重新表述。由于这个公式对于开发系统发育网络算法的研究人员和工程师来说可能更自然，我们希望能够让这些观众更容易地利用输入网络的树宽作为参数。

{"title":"Treewidth-based algorithms for the small parsimony problem on networks.","authors":"Celine Scornavacca, Mathias Weller","doi":"10.1186/s13015-022-00216-w","DOIUrl":"https://doi.org/10.1186/s13015-022-00216-w","url":null,"abstract":"Background: Phylogenetic reconstruction is one of the paramount challenges of contemporary bioinformatics. A subtask of existing tree reconstruction algorithms is modeled by the SMALL PARSIMONY problem: given a tree T and an assignment of character-states to its leaves, assign states to the internal nodes of T such as to minimize the parsimony score, that is, the number of edges of T connecting nodes with different states. While this problem is polynomial-time solvable on trees, the matter is more complicated if T contains reticulate events such as hybridizations or recombinations, i.e. when T is a network. Indeed, three different versions of the parsimony score on networks have been proposed and each of them is NP-hard to decide. Existing parameterized algorithms focus on combining the number c of possible character-states with the number of reticulate events (per biconnected component).Results: We consider the parameter treewidth t of the underlying undirected graph of the input network, presenting dynamic programming algorithms for (slight generalizations of) all three versions of the parsimony problem on size-n networks running in times [Formula: see text], [Formula: see text], and [Formula: see text], respectively. Our algorithms use a formulation of the treewidth that may facilitate formalizing treewidth-based dynamic programming algorithms on phylogenetic networks for other problems.Conclusions: Our algorithms allow the computation of the three popular parsimony scores, modeling the evolutionary development of a (multistate) character on a given phylogenetic network of low treewidth. Our results subsume and improve previously known algorithm for all three variants. While our results rely on being given a \"good\" tree-decomposition of the input, encouraging theoretical results as well as practical implementations producing them are publicly available. We present a reformulation of tree decompositions in terms of \"agreeing trees\" on the same set of nodes. As this formulation may come more natural to researchers and engineers developing algorithms for phylogenetic networks, we hope to render exploiting the input network's treewidth as parameter more accessible to this audience.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"15"},"PeriodicalIF":1.0,"publicationDate":"2022-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9392953/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40428950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Binning long reads in metagenomics datasets using composition and coverage information. 使用组合和覆盖信息对宏基因组数据集中的长读取进行分组。

IF 1 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology

Pub Date : 2022-07-11 DOI: 10.1186/s13015-022-00221-z

Anuradha Wickramarachchi, Yu Lin

Background: Advancements in metagenomics sequencing allow the study of microbial communities directly from their environments. Metagenomics binning is a key step in the species characterisation of microbial communities. Next-generation sequencing reads are usually assembled into contigs for metagenomics binning mainly due to the limited information within short reads. Third-generation sequencing provides much longer reads that have lengths similar to the contigs assembled from short reads. However, existing contig-binning tools cannot be directly applied on long reads due to the absence of coverage information and the presence of high error rates. The few existing long-read binning tools either use only composition or use composition and coverage information separately. This may ignore bins that correspond to low-abundance species or erroneously split bins that correspond to species with non-uniform coverages. Here we present a reference-free binning approach, LRBinner, that combines composition and coverage information of complete long-read datasets. LRBinner also uses a distance-histogram-based clustering algorithm to extract clusters with varying sizes.

Results: The experimental results on both simulated and real datasets show that LRBinner achieves the best binning accuracy in most cases while handling the complete datasets without any sampling. Moreover, we show that binning reads using LRBinner prior to assembly reduces computational resources required for assembly while attaining satisfactory assembly qualities.

Conclusion: LRBinner shows that deep-learning techniques can be used for effective feature aggregation to support the metagenomics binning of long reads. Furthermore, accurate binning of long reads supports improvements in metagenomics assembly, especially in complex datasets. Binning also helps to reduce the resources required for assembly. Source code for LRBinner is freely available at https://github.com/anuradhawick/LRBinner.

背景:宏基因组测序技术的进步使得直接从环境中研究微生物群落成为可能。宏基因组分类是微生物群落物种特征的关键步骤。由于短序列的信息有限，下一代测序reads通常被组装成contigs进行宏基因组组合。第三代测序提供了更长的序列，其长度与由短序列组装的contigs相似。然而，由于缺乏覆盖信息和存在高错误率，现有的组合合并工具不能直接应用于长读。少数现有的长读存储工具要么只使用组合，要么单独使用组合和覆盖信息。这可能会忽略与低丰度物种相对应的箱型或与非均匀覆盖的物种相对应的错误分割箱型。在这里，我们提出了一种无参考的分类方法，LRBinner，它结合了完整长读数据集的组成和覆盖信息。LRBinner还使用了一种基于距离直方图的聚类算法来提取不同大小的聚类。结果:在模拟和真实数据集上的实验结果表明，在不进行任何采样的完整数据集上，LRBinner在大多数情况下都能达到最佳的分箱精度。此外，我们表明，在装配之前使用LRBinner进行分组读取可以减少装配所需的计算资源，同时获得令人满意的装配质量。结论:LRBinner表明，深度学习技术可以用于有效的特征聚合，以支持长读段的宏基因组分类。此外，长读段的精确分组支持宏基因组组装的改进，特别是在复杂的数据集中。分箱还有助于减少组装所需的资源。LRBinner的源代码可在https://github.com/anuradhawick/LRBinner免费获得。

{"title":"Binning long reads in metagenomics datasets using composition and coverage information.","authors":"Anuradha Wickramarachchi, Yu Lin","doi":"10.1186/s13015-022-00221-z","DOIUrl":"https://doi.org/10.1186/s13015-022-00221-z","url":null,"abstract":"Background: Advancements in metagenomics sequencing allow the study of microbial communities directly from their environments. Metagenomics binning is a key step in the species characterisation of microbial communities. Next-generation sequencing reads are usually assembled into contigs for metagenomics binning mainly due to the limited information within short reads. Third-generation sequencing provides much longer reads that have lengths similar to the contigs assembled from short reads. However, existing contig-binning tools cannot be directly applied on long reads due to the absence of coverage information and the presence of high error rates. The few existing long-read binning tools either use only composition or use composition and coverage information separately. This may ignore bins that correspond to low-abundance species or erroneously split bins that correspond to species with non-uniform coverages. Here we present a reference-free binning approach, LRBinner, that combines composition and coverage information of complete long-read datasets. LRBinner also uses a distance-histogram-based clustering algorithm to extract clusters with varying sizes.Results: The experimental results on both simulated and real datasets show that LRBinner achieves the best binning accuracy in most cases while handling the complete datasets without any sampling. Moreover, we show that binning reads using LRBinner prior to assembly reduces computational resources required for assembly while attaining satisfactory assembly qualities.Conclusion: LRBinner shows that deep-learning techniques can be used for effective feature aggregation to support the metagenomics binning of long reads. Furthermore, accurate binning of long reads supports improvements in metagenomics assembly, especially in complex datasets. Binning also helps to reduce the resources required for assembly. Source code for LRBinner is freely available at https://github.com/anuradhawick/LRBinner.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"14"},"PeriodicalIF":1.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9277797/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40587433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Embedding gene trees into phylogenetic networks by conflict resolution algorithms 通过冲突解决算法将基因树嵌入系统发育网络

IF 1 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology

Pub Date : 2022-05-19 DOI: 10.1186/s13015-022-00218-8

Marcin Wawerka, D. Dabkowski, Natalia Rutecka, Agnieszka Mykowiecka, P. Górecki