The -ORIENTATION problem asks whether it is possible to orient an undirected graph to a directed phylogenetic network of a desired network class . This problem arises, for example, when visualising evolutionary data, as popular methods such as Neighbor-Net are distance-based and inevitably produce undirected graphs. The complexity of -ORIENTATION remains open for many classes , including binary tree-child networks, and practical methods are still lacking. In this paper, we propose (1) an exact FPT algorithm for -ORIENTATION, applicable to any class admitting a tractable membership test, and parameterised by the reticulation number and the maximum size of minimal basic cycles, and (2) a very fast heuristic for TREE-CHILD ORIENTATION. While the state-of-the-art for -ORIENTATION is a simple exponential time algorithm whose computational bottleneck lies in searching for appropriate reticulation vertex placements, our methods significantly reduce this search space. Experiments show that, although our FPT algorithm is still exponential, it significantly outperforms the existing method. The heuristic runs even faster but with increasing false negatives as the reticulation number grows. Given this trade-off, we also discuss theoretical directions for improvement and biological applicability of the heuristic approach.
{"title":"Orientability of undirected phylogenetic networks to a desired class: practical algorithms and application to tree-child orientation.","authors":"Tsuyoshi Urata, Manato Yokoyama, Haruki Miyaji, Momoko Hayamizu","doi":"10.1186/s13015-025-00282-w","DOIUrl":"10.1186/s13015-025-00282-w","url":null,"abstract":"<p><p>The <math><mi>C</mi></math> -ORIENTATION problem asks whether it is possible to orient an undirected graph to a directed phylogenetic network of a desired network class <math><mi>C</mi></math> . This problem arises, for example, when visualising evolutionary data, as popular methods such as Neighbor-Net are distance-based and inevitably produce undirected graphs. The complexity of <math><mi>C</mi></math> -ORIENTATION remains open for many classes <math><mi>C</mi></math> , including binary tree-child networks, and practical methods are still lacking. In this paper, we propose (1) an exact FPT algorithm for <math><mi>C</mi></math> -ORIENTATION, applicable to any class <math><mi>C</mi></math> admitting a tractable membership test, and parameterised by the reticulation number and the maximum size of minimal basic cycles, and (2) a very fast heuristic for TREE-CHILD ORIENTATION. While the state-of-the-art for <math><mi>C</mi></math> -ORIENTATION is a simple exponential time algorithm whose computational bottleneck lies in searching for appropriate reticulation vertex placements, our methods significantly reduce this search space. Experiments show that, although our FPT algorithm is still exponential, it significantly outperforms the existing method. The heuristic runs even faster but with increasing false negatives as the reticulation number grows. Given this trade-off, we also discuss theoretical directions for improvement and biological applicability of the heuristic approach.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"21 1","pages":"2"},"PeriodicalIF":1.7,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12874789/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146127177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-29DOI: 10.1186/s13015-025-00294-6
Nora Beier, Thomas Gatter, Jakob L Andersen, Peter F Stadler
Background: Atom-to-atom maps play an important role in many applications. However, they are often difficult to obtain. The KEGG reaction database does not provide atom-to-atom maps for its reactions and instead offers a description of local changes for pairs of reactant and product molecules in terms of so-called RCLASSes. Developed for classification purposes, RCLASS data are difficult to use for purposes such as the construction of atom-to-atom maps or reaction rules. DPO graph transformation rules, on the other hand, work as a convenient and efficient representation, particularly for these applications. The RCLASS data can be understood as collections of local graph patterns in the reactants and products of a reaction, together with partial correspondences of atoms. The problem of converting RCLASS data into DPO rules, therefore, is a special case of the graph reconstruction problem, which consists of inferring a graph from a collection of subgraphs.
Results: We developed laveau, a tool that computes explicit DPO rules from KEGG reactions and RCLASS data. The algorithm proceeds stepwise, starting with a translation of individual RDM codes, specifically developed by the KEGG database, into equivalent RDM pattern graphs. Multiple RDM pattern graphs for the same RCLASS are then combined based on their embeddings into the reactant and product molecules, observing certain consistency conditions. In the final step, these combined pairwise patterns are merged into a pair of subgraphs of reactants and products, respectively. If RCLASSes connecting all pairs of reactant and product molecules are available, the complete reaction center(s) is/are contained in the union of these subgraphs. The atom-to-atom map inherited from the RDM codes then defines a DPO transformation rule. Application of these rules to the reactants then yields complete atom-to-atom maps (AAMs). Starting from 3195 RCLASSes, laveau generates a total of 1232 DPO rules and 1594 AAMs.
Conclusions: The laveau software makes it possible to extract local atom-to-atom maps from the RCLASSes of the KEGG database, covering a large set of enzyme-catalyzed reactions. The results are made available in the form of DPO rules for use in atom-level models of metabolic networks, filling a crucial gap in the available data.
{"title":"Computing double-pushout graph transformation rules and atom-to-atom maps from KEGG RCLASS data.","authors":"Nora Beier, Thomas Gatter, Jakob L Andersen, Peter F Stadler","doi":"10.1186/s13015-025-00294-6","DOIUrl":"https://doi.org/10.1186/s13015-025-00294-6","url":null,"abstract":"<p><strong>Background: </strong>Atom-to-atom maps play an important role in many applications. However, they are often difficult to obtain. The KEGG reaction database does not provide atom-to-atom maps for its reactions and instead offers a description of local changes for pairs of reactant and product molecules in terms of so-called RCLASSes. Developed for classification purposes, RCLASS data are difficult to use for purposes such as the construction of atom-to-atom maps or reaction rules. DPO graph transformation rules, on the other hand, work as a convenient and efficient representation, particularly for these applications. The RCLASS data can be understood as collections of local graph patterns in the reactants and products of a reaction, together with partial correspondences of atoms. The problem of converting RCLASS data into DPO rules, therefore, is a special case of the graph reconstruction problem, which consists of inferring a graph from a collection of subgraphs.</p><p><strong>Results: </strong>We developed laveau, a tool that computes explicit DPO rules from KEGG reactions and RCLASS data. The algorithm proceeds stepwise, starting with a translation of individual RDM codes, specifically developed by the KEGG database, into equivalent RDM pattern graphs. Multiple RDM pattern graphs for the same RCLASS are then combined based on their embeddings into the reactant and product molecules, observing certain consistency conditions. In the final step, these combined pairwise patterns are merged into a pair of subgraphs of reactants and products, respectively. If RCLASSes connecting all pairs of reactant and product molecules are available, the complete reaction center(s) is/are contained in the union of these subgraphs. The atom-to-atom map inherited from the RDM codes then defines a DPO transformation rule. Application of these rules to the reactants then yields complete atom-to-atom maps (AAMs). Starting from 3195 RCLASSes, laveau generates a total of 1232 DPO rules and 1594 AAMs.</p><p><strong>Conclusions: </strong>The laveau software makes it possible to extract local atom-to-atom maps from the RCLASSes of the KEGG database, covering a large set of enzyme-catalyzed reactions. The results are made available in the form of DPO rules for use in atom-level models of metabolic networks, filling a crucial gap in the available data.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":""},"PeriodicalIF":1.7,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146087873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1186/s13015-025-00293-7
Siying Yang, Neng Huang, Heng Li
Motivation: Spliced alignment refers to the alignment of messenger RNA (mRNA) or protein sequences to eukaryotic genomes. It plays a critical role in gene annotation and the study of gene functions. Accurate spliced alignment demands sophisticated modeling of splice sites, but current aligners use simple models, which may affect their accuracy given dissimilar sequences.
Results: We implemented minisplice to learn splice signals with a one-dimensional convolutional neural network (1D-CNN) and trained a model with 7026 parameters for vertebrate and insect genomes. It captures conserved splice signals across phyla and reveals GC-rich introns specific to mammals and birds. We used this model to estimate the empirical splicing probability for every GT and AG in genomes, and modified minimap2 and miniprot to leverage pre-computed splicing probability during alignment. Evaluation on human long-read RNA-seq data and cross-species protein datasets showed our method greatly improves the junction accuracy especially for noisy long RNA-seq reads and proteins of distant homology.
Availability and implementation: https://github.com/lh3/minisplice.
{"title":"Improving spliced alignment by modeling splice sites with deep learning.","authors":"Siying Yang, Neng Huang, Heng Li","doi":"10.1186/s13015-025-00293-7","DOIUrl":"10.1186/s13015-025-00293-7","url":null,"abstract":"<p><strong>Motivation: </strong>Spliced alignment refers to the alignment of messenger RNA (mRNA) or protein sequences to eukaryotic genomes. It plays a critical role in gene annotation and the study of gene functions. Accurate spliced alignment demands sophisticated modeling of splice sites, but current aligners use simple models, which may affect their accuracy given dissimilar sequences.</p><p><strong>Results: </strong>We implemented minisplice to learn splice signals with a one-dimensional convolutional neural network (1D-CNN) and trained a model with 7026 parameters for vertebrate and insect genomes. It captures conserved splice signals across phyla and reveals GC-rich introns specific to mammals and birds. We used this model to estimate the empirical splicing probability for every GT and AG in genomes, and modified minimap2 and miniprot to leverage pre-computed splicing probability during alignment. Evaluation on human long-read RNA-seq data and cross-species protein datasets showed our method greatly improves the junction accuracy especially for noisy long RNA-seq reads and proteins of distant homology.</p><p><strong>Availability and implementation: </strong>https://github.com/lh3/minisplice.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"1"},"PeriodicalIF":1.7,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12766944/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145896893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-11DOI: 10.1186/s13015-025-00291-9
Simon Gene Gottlieb, Knut Reinert
Adding rank support to strings over a fixed-sized alphabet has numerous applications. Prominent among those is the (bidirectional) FM-Index which is commonly utilized to index and analyze genomic data. At its core lies the rank operation on the Burrows-Wheeler-Transform (BWT) which, given a position in the BWT and a character, answers how often the specified character appears from the start to that position. Implementing those rank queries is usually based on bit vectors with rank support. In this work, we discuss three implementation improvements. First, a novel approach named paired-blocks that reduces the space overhead of the support structure by half to a total of only . Second, a method for masking bits for the population count (also known as popcount) which greatly improves the runtime of 512-bit wide blocks in conjunction with AVX512 SIMD extensions. Third, a revised method for EPR-dictionaries (Pockrandt et al. in International conference on research in computational molecular biology. Springer, New York, 2017) called flattened bit vectors (fBV) with less space consumption and faster rank operations on strings, which is competitive in size and depending on the parameters between and faster than Wavelet Trees (Gog et al. in 13th International Symposium on Experimental Algorithms. Springer, New York, 2014).
为固定大小的字母表上的字符串添加秩支持有许多应用。其中最突出的是(双向)FM-Index,它通常用于索引和分析基因组数据。其核心是Burrows-Wheeler-Transform (BWT)上的秩运算,给定BWT中的位置和一个字符,它回答指定字符从开始到该位置出现的频率。实现这些秩查询通常基于具有秩支持的位向量。在这项工作中,我们讨论了三个实现改进。首先,一种名为成对块的新方法将支撑结构的空间开销减少了一半,总计仅为1.6%。其次,为填充计数(也称为popcount)屏蔽位的方法,它与AVX512 SIMD扩展一起极大地改善了512位宽块的运行时。第三,epr - dictionary的修正方法(Pockrandt et al. in computational molecular biology研究国际会议)。施普林格,New York, 2017)称为扁平位向量(fBV),具有更少的空间消耗和更快的字符串排序操作,其在大小上具有竞争力,取决于参数的速度比小波树快2到9倍(Gog等人在第13届国际实验算法研讨会上)。b施普林格,纽约,2014)。
{"title":"Engineering rank queries on bit vectors and strings.","authors":"Simon Gene Gottlieb, Knut Reinert","doi":"10.1186/s13015-025-00291-9","DOIUrl":"10.1186/s13015-025-00291-9","url":null,"abstract":"<p><p>Adding rank support to strings over a fixed-sized alphabet has numerous applications. Prominent among those is the (bidirectional) FM-Index which is commonly utilized to index and analyze genomic data. At its core lies the rank operation on the Burrows-Wheeler-Transform (BWT) which, given a position in the BWT and a character, answers how often the specified character appears from the start to that position. Implementing those rank queries is usually based on bit vectors with rank support. In this work, we discuss three implementation improvements. First, a novel approach named paired-blocks that reduces the space overhead of the support structure by half to a total of only <math><mrow><mn>1.6</mn> <mo>%</mo></mrow> </math> . Second, a method for masking bits for the population count (also known as popcount) which greatly improves the runtime of 512-bit wide blocks in conjunction with AVX512 SIMD extensions. Third, a revised method for EPR-dictionaries (Pockrandt et al. in International conference on research in computational molecular biology. Springer, New York, 2017) called flattened bit vectors (fBV) with less space consumption and faster rank operations on strings, which is competitive in size and depending on the parameters between <math><mrow><mn>2</mn> <mo>×</mo></mrow> </math> and <math><mrow><mn>9</mn> <mo>×</mo></mrow> </math> faster than Wavelet Trees (Gog et al. in 13th International Symposium on Experimental Algorithms. Springer, New York, 2014).</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"21"},"PeriodicalIF":1.7,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12703928/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145745584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-10DOI: 10.1186/s13015-025-00292-8
Kimon Boehmer, Sarah J Berkemer, Sebastian Will, Yann Ponty
RNAs composed of Triplet Repeats (TR) have recently attracted much attention in the field of synthetic biology. We study the mimimum free energy (MFE) secondary structures of such RNAs and give improved algorithms to compute the MFE and the partition function. Furthermore, we study the interaction of multiple RNAs and design a new algorithm for computing MFE and partition function for RNA-RNA interactions, improving the previously known factorial running time to exponential. In the case of TR, we show computational hardness but still obtain a parameterized algorithm. Finally, we propose a polynomial-time algorithm for computing interactions from a base set of RNA strands and conduct experiments on the interaction of TR based on this algorithm. For instance, we study the probability that a base pair is formed between two strands with the same triplet pattern, allowing an assessment of a notion of orthogonality between TR.
{"title":"Rna triplet repeats: improved algorithms for structure prediction and interactions.","authors":"Kimon Boehmer, Sarah J Berkemer, Sebastian Will, Yann Ponty","doi":"10.1186/s13015-025-00292-8","DOIUrl":"https://doi.org/10.1186/s13015-025-00292-8","url":null,"abstract":"<p><p>RNAs composed of Triplet Repeats (TR) have recently attracted much attention in the field of synthetic biology. We study the mimimum free energy (MFE) secondary structures of such RNAs and give improved algorithms to compute the MFE and the partition function. Furthermore, we study the interaction of multiple RNAs and design a new algorithm for computing MFE and partition function for RNA-RNA interactions, improving the previously known factorial running time to exponential. In the case of TR, we show computational hardness but still obtain a parameterized algorithm. Finally, we propose a polynomial-time algorithm for computing interactions from a base set of RNA strands and conduct experiments on the interaction of TR based on this algorithm. For instance, we study the probability that a base pair is formed between two strands with the same triplet pattern, allowing an assessment of a notion of orthogonality between TR.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145726712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-24DOI: 10.1186/s13015-025-00278-6
Théo Boury, Samuel Gardelle, Laurent Bulteau, Yann Ponty
Inverse folding is a classic instance of negative RNA design which consists in finding a sequence that uniquely folds into a target secondary structure with respect to energy minimization. A breakthrough result of Bonnet et al. shows that, even in simple base pairs-based (BP) models, the decision version of a mildly constrained version of inverse folding is NP-hard. In this work, we show that inverse folding can be solved in linear time for a large collection of targets, including every structure that contains no isolated BP and no isolated stack (or, equivalently, when all helices consist of base pairs). For structures featuring shorter helices, our linear algorithm is no longer guaranteed to produce a solution, but still does so for a large proportion of instances. Our approach introduces a notion of modulo m-separability, generalizing a property pioneered by Hales et al. Separability is a sufficient condition for the existence of a solution to the inverse folding problem. We show that, for any input secondary structure of length n, a modulo m-separated sequence can be produced in time anytime such a sequence exists. Meanwhile, we show that any structure consisting of base pairs is either trivially non-designable, or always admits a modulo-2 separated solution. Solution sequences can thus be produced in linear time, and even be uniformly generated within the set of modulo-2 separable sequences.
逆折叠是负RNA设计的一个经典实例,它包括寻找一个序列,唯一折叠成目标二级结构,相对于能量最小化。Bonnet等人的突破性成果表明,即使在简单的基于碱基对(BP)模型中,轻度约束逆折叠的决策版本也是np困难的。在这项工作中,我们证明了逆折叠可以在线性时间内解决大型目标集合,包括每个不包含孤立BP和孤立堆栈的结构(或者,等效地,当所有螺旋由3 +碱基对组成)。对于具有较短螺旋的结构,我们的线性算法不再保证产生解决方案,但对于很大比例的实例仍然如此。我们的方法引入了模m可分性的概念,推广了由Hales等人开创的一个性质。可分性是逆折叠问题解存在的充分条件。我们证明,对于任何长度为n的输入二级结构,只要存在模m分离序列,就可以在O (n m 2 m)时间内产生模m分离序列。同时,我们证明了任何由3 +碱基对组成的结构要么是平凡的不可设计的,要么总是允许模-2分离的解。因此可以在线性时间内生成解序列,甚至在模-2可分序列集合内均匀生成解序列。
{"title":"RNA inverse folding can be solved in linear time for structures without isolated stacks or base pairs.","authors":"Théo Boury, Samuel Gardelle, Laurent Bulteau, Yann Ponty","doi":"10.1186/s13015-025-00278-6","DOIUrl":"10.1186/s13015-025-00278-6","url":null,"abstract":"<p><p>Inverse folding is a classic instance of negative RNA design which consists in finding a sequence that uniquely folds into a target secondary structure with respect to energy minimization. A breakthrough result of Bonnet et al. shows that, even in simple base pairs-based (BP) models, the decision version of a mildly constrained version of inverse folding is NP-hard. In this work, we show that inverse folding can be solved in linear time for a large collection of targets, including every structure that contains no isolated BP and no isolated stack (or, equivalently, when all helices consist of <math><msup><mn>3</mn> <mo>+</mo></msup> </math> base pairs). For structures featuring shorter helices, our linear algorithm is no longer guaranteed to produce a solution, but still does so for a large proportion of instances. Our approach introduces a notion of modulo m-separability, generalizing a property pioneered by Hales et al. Separability is a sufficient condition for the existence of a solution to the inverse folding problem. We show that, for any input secondary structure of length n, a modulo m-separated sequence can be produced in time <math><mrow><mi>O</mi> <mo>(</mo> <mi>n</mi> <mspace></mspace> <mi>m</mi> <mspace></mspace> <msup><mn>2</mn> <mi>m</mi></msup> <mo>)</mo></mrow> </math> anytime such a sequence exists. Meanwhile, we show that any structure consisting of <math><msup><mn>3</mn> <mo>+</mo></msup> </math> base pairs is either trivially non-designable, or always admits a modulo-2 separated solution. Solution sequences can thus be produced in linear time, and even be uniformly generated within the set of modulo-2 separable sequences.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"20"},"PeriodicalIF":1.7,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12553252/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145369269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-02DOI: 10.1186/s13015-025-00285-7
Anna Lindeberg, Guillaume E Scholz, Nicolas Wieseke, Marc Hellmuth
Orthologous genes, which arise through speciation, play a key role in comparative genomics and functional inference. In particular, graph-based methods allow for the inference of orthology estimates without prior knowledge of the underlying gene or species trees. This results in orthology graphs, where each vertex represents a gene, and an edge exists between two vertices if the corresponding genes are estimated to be orthologs. Orthology graphs inferred under a tree-like evolutionary model must be cographs. However, real-world data often deviate from this property, either due to noise in the data, errors in inference methods or, simply, because evolution follows a network-like rather than a tree-like process. The latter, in particular, raises the question of whether and how orthology graphs can be derived from or, equivalently, are explained by phylogenetic networks. In this work, we study the constraints imposed on orthology graphs when the underlying evolutionary history follows a phylogenetic network instead of a tree. We show that any orthology graph can be represented by a sufficiently complex level-k network. However, such networks lack biologically meaningful constraints. In contrast, level-1 networks provide a simpler explanation, and we establish characterizations for level-1 explainable orthology graphs, i.e., those derived from level-1 evolutionary histories. To this end, we employ modular decomposition, a classical technique for studying graph structures. Specifically, an arbitrary graph is level-1 explainable if and only if each primitive subgraph is a near-cograph (a graph in which the removal of a single vertex results in a cograph). Additionally, we present a linear-time algorithm to recognize level-1 explainable orthology graphs and to construct a level-1 network that explains them, if such a network exists. Finally, we demonstrate the close relationship of level-1 explainable orthology graphs to the substitution operation, weakly chordal and perfect graphs, as well as graphs with twin-width at most 2.
{"title":"Orthology and near-cographs in the context of phylogenetic networks.","authors":"Anna Lindeberg, Guillaume E Scholz, Nicolas Wieseke, Marc Hellmuth","doi":"10.1186/s13015-025-00285-7","DOIUrl":"10.1186/s13015-025-00285-7","url":null,"abstract":"<p><p>Orthologous genes, which arise through speciation, play a key role in comparative genomics and functional inference. In particular, graph-based methods allow for the inference of orthology estimates without prior knowledge of the underlying gene or species trees. This results in orthology graphs, where each vertex represents a gene, and an edge exists between two vertices if the corresponding genes are estimated to be orthologs. Orthology graphs inferred under a tree-like evolutionary model must be cographs. However, real-world data often deviate from this property, either due to noise in the data, errors in inference methods or, simply, because evolution follows a network-like rather than a tree-like process. The latter, in particular, raises the question of whether and how orthology graphs can be derived from or, equivalently, are explained by phylogenetic networks. In this work, we study the constraints imposed on orthology graphs when the underlying evolutionary history follows a phylogenetic network instead of a tree. We show that any orthology graph can be represented by a sufficiently complex level-k network. However, such networks lack biologically meaningful constraints. In contrast, level-1 networks provide a simpler explanation, and we establish characterizations for level-1 explainable orthology graphs, i.e., those derived from level-1 evolutionary histories. To this end, we employ modular decomposition, a classical technique for studying graph structures. Specifically, an arbitrary graph is level-1 explainable if and only if each primitive subgraph is a near-cograph (a graph in which the removal of a single vertex results in a cograph). Additionally, we present a linear-time algorithm to recognize level-1 explainable orthology graphs and to construct a level-1 network that explains them, if such a network exists. Finally, we demonstrate the close relationship of level-1 explainable orthology graphs to the substitution operation, weakly chordal and perfect graphs, as well as graphs with twin-width at most 2.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"19"},"PeriodicalIF":1.7,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12490074/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145214285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we lay the groundwork on the comparison of phylogenetic networks based on edge contractions and expansions as edit operations, as originally proposed by Robinson and Foulds to compare trees. We prove that these operations connect the space of all phylogenetic networks on the same set of leaves, even if we forbid contractions that create cycles. This allows to define an operational distance on this space, as the minimum number of contractions and expansions required to transform one network into another. We highlight the difference between this distance and the computation of the maximum common contraction between two networks. Given its ability to outline a common structure between them, which can provide valuable biological insights, we study the algorithmic aspects of the latter. We first prove that computing a maximum common contraction between two networks is NP-hard, even when the maximum degree, the size of the common contraction, or the number of leaves is bounded. We also provide lower bounds to the problem based on the Exponential-Time Hypothesis. Nonetheless, we do provide a polynomial-time algorithm for weakly galled trees, a generalization of galled trees.
{"title":"Finding maximum common contractions between phylogenetic networks.","authors":"Bertrand Marchand, Nadia Tahiri, Shohreh Golpaigani Fard, Olivier Tremblay-Savard, Manuel Lafond","doi":"10.1186/s13015-025-00283-9","DOIUrl":"10.1186/s13015-025-00283-9","url":null,"abstract":"<p><p>In this paper, we lay the groundwork on the comparison of phylogenetic networks based on edge contractions and expansions as edit operations, as originally proposed by Robinson and Foulds to compare trees. We prove that these operations connect the space of all phylogenetic networks on the same set of leaves, even if we forbid contractions that create cycles. This allows to define an operational distance on this space, as the minimum number of contractions and expansions required to transform one network into another. We highlight the difference between this distance and the computation of the maximum common contraction between two networks. Given its ability to outline a common structure between them, which can provide valuable biological insights, we study the algorithmic aspects of the latter. We first prove that computing a maximum common contraction between two networks is NP-hard, even when the maximum degree, the size of the common contraction, or the number of leaves is bounded. We also provide lower bounds to the problem based on the Exponential-Time Hypothesis. Nonetheless, we do provide a polynomial-time algorithm for weakly galled trees, a generalization of galled trees.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"18"},"PeriodicalIF":1.7,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12490124/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145208207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-29DOI: 10.1186/s13015-025-00290-w
Gleb Buzanov, Vsevolod Makeev
A biological study can produce a limited number of marker genes, not large enough to be used in gene set enrichment analysis. Here we suggest VOL-Gene, a graph-based algorithm that partitions all genes into non-overlapping classes of functionally related genes, thus assigning a single function to each gene. To this end, many functional signatures are combined into a single weighted graph, which is partitioned into cliques. For a poorly annotated marker gene, our approach fetches a number of genes that belong to the same class, some of which can be well annotated and are likely to take part in the same biological process.
{"title":"Exclusive functional signatures for gene annotation with vast OpenOrd layout.","authors":"Gleb Buzanov, Vsevolod Makeev","doi":"10.1186/s13015-025-00290-w","DOIUrl":"10.1186/s13015-025-00290-w","url":null,"abstract":"<p><p>A biological study can produce a limited number of marker genes, not large enough to be used in gene set enrichment analysis. Here we suggest VOL-Gene, a graph-based algorithm that partitions all genes into non-overlapping classes of functionally related genes, thus assigning a single function to each gene. To this end, many functional signatures are combined into a single weighted graph, which is partitioned into cliques. For a poorly annotated marker gene, our approach fetches a number of genes that belong to the same class, some of which can be well annotated and are likely to take part in the same biological process.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"17"},"PeriodicalIF":1.7,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12482410/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145193964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-19DOI: 10.1186/s13015-025-00284-8
Alitzel López Sánchez, José Antonio Ramírez-Rafael, Alejandro Flores-Lamas, Maribel Hernández-Rosales, Manuel Lafond
Background: In this study, we investigate the problem of comparing gene trees reconciled with the same species tree using a novel semi-metric, called the Path-Label Reconciliation (PLR) dissimilarity measure. This approach not only quantifies differences in the topology of reconciled gene trees, but also considers discrepancies in predicted ancestral gene-species maps and speciation/duplication events, offering a refinement of existing metrics such as Robinson-Foulds (RF) and their labeled extensions LRF and ELRF. A tunable parameter also allows users to adjust the balance between its species map and event labeling components.
Our contributions: We show that PLR can be computed in linear time and that it is a semi-metric. We also discuss the diameters of reconciled gene tree measures, which are important in practice for normalization, and provide initial bounds on PLR, LRF, and ELRF. To validate PLR, we simulate reconciliations and perform comparisons with LRF and ELRF. The results show that PLR provides a more evenly distributed range of distances, making it less susceptible to overestimating differences in the presence of small topological changes, while at the same time being computationally efficient. We also apply our measure to evaluate the set of possible rootings of gene trees against a gold standard, and demonstrate that our measure is better at distinguishing one best gene tree among multiple candidates. Furthermore, our findings suggest that the theoretical diameter is rarely reached in practice. The PLR measure advances phylogenetic reconciliation by combining theoretical rigor with practical applicability. Future research will refine its mathematical properties, explore its performance on different types of trees, and integrate it with existing bioinformatics tools for large-scale evolutionary analyses. The implementation of the PLR distance is available in the open-source PyPI package parle: https://pypi.org/project/parle/ .
{"title":"The path-label reconciliation (PLR) dissimilarity measure for gene trees.","authors":"Alitzel López Sánchez, José Antonio Ramírez-Rafael, Alejandro Flores-Lamas, Maribel Hernández-Rosales, Manuel Lafond","doi":"10.1186/s13015-025-00284-8","DOIUrl":"10.1186/s13015-025-00284-8","url":null,"abstract":"<p><strong>Background: </strong>In this study, we investigate the problem of comparing gene trees reconciled with the same species tree using a novel semi-metric, called the Path-Label Reconciliation (PLR) dissimilarity measure. This approach not only quantifies differences in the topology of reconciled gene trees, but also considers discrepancies in predicted ancestral gene-species maps and speciation/duplication events, offering a refinement of existing metrics such as Robinson-Foulds (RF) and their labeled extensions LRF and ELRF. A tunable parameter <math><mi>α</mi></math> also allows users to adjust the balance between its species map and event labeling components.</p><p><strong>Our contributions: </strong>We show that PLR can be computed in linear time and that it is a semi-metric. We also discuss the diameters of reconciled gene tree measures, which are important in practice for normalization, and provide initial bounds on PLR, LRF, and ELRF. To validate PLR, we simulate reconciliations and perform comparisons with LRF and ELRF. The results show that PLR provides a more evenly distributed range of distances, making it less susceptible to overestimating differences in the presence of small topological changes, while at the same time being computationally efficient. We also apply our measure to evaluate the set of possible rootings of gene trees against a gold standard, and demonstrate that our measure is better at distinguishing one best gene tree among multiple candidates. Furthermore, our findings suggest that the theoretical diameter is rarely reached in practice. The PLR measure advances phylogenetic reconciliation by combining theoretical rigor with practical applicability. Future research will refine its mathematical properties, explore its performance on different types of trees, and integrate it with existing bioinformatics tools for large-scale evolutionary analyses. The implementation of the PLR distance is available in the open-source PyPI package parle: https://pypi.org/project/parle/ .</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"16"},"PeriodicalIF":1.7,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12366074/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144884255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}