Pub Date : 2025-04-05DOI: 10.1186/s13015-025-00275-9
Karl K Käther, Andreas Remmel, Steffen Lemke, Peter F Stadler
Orthology inference lies at the foundation of comparative genomics research. The correct identification of loci which descended from a common ancestral sequence is not only complicated by sequence divergence but also duplication and other genome rearrangements. The conservation of gene order, i.e. synteny, is used in conjunction with sequence similarity as an additional factor for orthology determination. Current approaches, however, rely on genome annotations and are therefore limited. Here we present an annotation-free approach and compare it to synteny analysis with annotations. We find that our approach works better in closely related genomes whereas there is a better performance with annotations for more distantly related genomes. Overall, the presented algorithm offers a useful alternative to annotation-based methods and can outperform them in many cases.
{"title":"Unbiased anchors for reliable genome-wide synteny detection.","authors":"Karl K Käther, Andreas Remmel, Steffen Lemke, Peter F Stadler","doi":"10.1186/s13015-025-00275-9","DOIUrl":"10.1186/s13015-025-00275-9","url":null,"abstract":"<p><p>Orthology inference lies at the foundation of comparative genomics research. The correct identification of loci which descended from a common ancestral sequence is not only complicated by sequence divergence but also duplication and other genome rearrangements. The conservation of gene order, i.e. synteny, is used in conjunction with sequence similarity as an additional factor for orthology determination. Current approaches, however, rely on genome annotations and are therefore limited. Here we present an annotation-free approach and compare it to synteny analysis with annotations. We find that our approach works better in closely related genomes whereas there is a better performance with annotations for more distantly related genomes. Overall, the presented algorithm offers a useful alternative to annotation-based methods and can outperform them in many cases.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"5"},"PeriodicalIF":1.5,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11972476/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143788963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-17DOI: 10.1186/s13015-025-00270-0
Ragnar Groot Koerkamp, Daniel Liu, Giulio Ermanno Pibiri
Sampling algorithms that deterministically select a subset of -mers are an important building block in bioinformatics applications. For example, they are used to index large textual collections, like DNA, and to compare sequences quickly. In such applications, a sampling algorithm is required to select one -mer out of every window of w consecutive -mers. The folklore and most used scheme is the random minimizer that selects the smallest -mer in the window according to some random order. This scheme is remarkably simple and versatile, and has a density (expected fraction of selected -mers) of . In practice, lower density leads to faster methods and smaller indexes, and it turns out that the random minimizer is not the best one can do. Indeed, some schemes are known to approach optimal density 1/w when , like the recently introduced mod-minimizer (Groot Koerkamp and Pibiri, WABI 2024). In this work, we study methods that achieve low density when . In this small-k regime, a practical method with provably better density than the random minimizer is the miniception (Zheng et al., Bioinformatics 2021). This method can be elegantly described as sampling the smallest closed sycnmer (Edgar, PeerJ 2021) in the window according to some random order. We show that extending the miniception to prefer sampling open syncmers yields much better density. This new method-the open-closed minimizer-offers improved density for small while being as fast to compute as the random minimizer. Compared to methods based on decycling sets, that achieve very low density in the small-k regime, our method has comparable density while being computationally simpler and intuitive. Furthermore, we extend the mod-minimizer to improve density of any scheme that works well for small k to also work well when is large. We hence obtain the open-closed mod-minimizer, a practical method that improves over the mod-minimizer for all k.
{"title":"The open-closed mod-minimizer algorithm.","authors":"Ragnar Groot Koerkamp, Daniel Liu, Giulio Ermanno Pibiri","doi":"10.1186/s13015-025-00270-0","DOIUrl":"10.1186/s13015-025-00270-0","url":null,"abstract":"<p><p>Sampling algorithms that deterministically select a subset of <math><mi>k</mi></math> -mers are an important building block in bioinformatics applications. For example, they are used to index large textual collections, like DNA, and to compare sequences quickly. In such applications, a sampling algorithm is required to select one <math><mi>k</mi></math> -mer out of every window of w consecutive <math><mi>k</mi></math> -mers. The folklore and most used scheme is the random minimizer that selects the smallest <math><mi>k</mi></math> -mer in the window according to some random order. This scheme is remarkably simple and versatile, and has a density (expected fraction of selected <math><mi>k</mi></math> -mers) of <math><mrow><mn>2</mn> <mo>/</mo> <mo>(</mo> <mi>w</mi> <mo>+</mo> <mn>1</mn> <mo>)</mo></mrow> </math> . In practice, lower density leads to faster methods and smaller indexes, and it turns out that the random minimizer is not the best one can do. Indeed, some schemes are known to approach optimal density 1/w when <math><mrow><mi>k</mi> <mo>→</mo> <mi>∞</mi></mrow> </math> , like the recently introduced mod-minimizer (Groot Koerkamp and Pibiri, WABI 2024). In this work, we study methods that achieve low density when <math><mrow><mi>k</mi> <mo>≤</mo> <mi>w</mi></mrow> </math> . In this small-k regime, a practical method with provably better density than the random minimizer is the miniception (Zheng et al., Bioinformatics 2021). This method can be elegantly described as sampling the smallest closed sycnmer (Edgar, PeerJ 2021) in the window according to some random order. We show that extending the miniception to prefer sampling open syncmers yields much better density. This new method-the open-closed minimizer-offers improved density for small <math><mrow><mi>k</mi> <mo>≤</mo> <mi>w</mi></mrow> </math> while being as fast to compute as the random minimizer. Compared to methods based on decycling sets, that achieve very low density in the small-k regime, our method has comparable density while being computationally simpler and intuitive. Furthermore, we extend the mod-minimizer to improve density of any scheme that works well for small k to also work well when <math><mrow><mi>k</mi> <mo>></mo> <mi>w</mi></mrow> </math> is large. We hence obtain the open-closed mod-minimizer, a practical method that improves over the mod-minimizer for all k.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"4"},"PeriodicalIF":1.5,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11912762/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143651867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-01DOI: 10.1186/s13015-025-00272-y
Stephen Hwang, Nathaniel K Brown, Omar Y Ahmed, Katharine M Jenike, Sam Kovaka, Michael C Schatz, Ben Langmead
Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation. Indexes based on k-mers and de Bruijn graphs are limited to answering questions at a specific substring length k. We present Maximal Exact Match Ordered (MEMO), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows. MEMO enables both queries that test k-mer presence/absence (membership queries) and that count the number of genomes containing k-mers in a window (conservation queries). MEMO's index for a pangenome of 89 human autosomal haplotypes fits in 2.04 GB, 8.8 smaller than a comparable KMC3 index and 11.4 smaller than a PanKmer index. MEMO indexes can be made smaller by sacrificing some counting resolution, with our decile-resolution HPRC index reaching 0.67 GB. MEMO can conduct a conservation query for 31-mers over the human leukocyte antigen locus in 13.89 s, 2.5 faster than other approaches. MEMO's small index size, lack of k-mer length dependence, and efficient queries make it a flexible tool for studying and visualizing substring conservation in pangenomes.
{"title":"Mem-based pangenome indexing for k-mer queries.","authors":"Stephen Hwang, Nathaniel K Brown, Omar Y Ahmed, Katharine M Jenike, Sam Kovaka, Michael C Schatz, Ben Langmead","doi":"10.1186/s13015-025-00272-y","DOIUrl":"10.1186/s13015-025-00272-y","url":null,"abstract":"<p><p>Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation. Indexes based on k-mers and de Bruijn graphs are limited to answering questions at a specific substring length k. We present Maximal Exact Match Ordered (MEMO), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows. MEMO enables both queries that test k-mer presence/absence (membership queries) and that count the number of genomes containing k-mers in a window (conservation queries). MEMO's index for a pangenome of 89 human autosomal haplotypes fits in 2.04 GB, 8.8 <math><mo>×</mo></math> smaller than a comparable KMC3 index and 11.4 <math><mo>×</mo></math> smaller than a PanKmer index. MEMO indexes can be made smaller by sacrificing some counting resolution, with our decile-resolution HPRC index reaching 0.67 GB. MEMO can conduct a conservation query for 31-mers over the human leukocyte antigen locus in 13.89 s, 2.5 <math><mo>×</mo></math> faster than other approaches. MEMO's small index size, lack of k-mer length dependence, and efficient queries make it a flexible tool for studying and visualizing substring conservation in pangenomes.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"3"},"PeriodicalIF":1.7,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11871630/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143538063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-28DOI: 10.1186/s13015-025-00273-x
Chris Jennings-Shaffer, David H Rich, Matthew Macaulay, Michael D Karcher, Tanvi Ganapathy, Shosuke Kiami, Anna Kooperberg, Cheng Zhang, Marc A Suchard, Frederick A Matsen
Bayesian phylogenetics typically estimates a posterior distribution, or aspects thereof, using Markov chain Monte Carlo methods. These methods integrate over tree space by applying local rearrangements to move a tree through its space as a random walk. Previous work explored the possibility of replacing this random walk with a systematic search, but was quickly overwhelmed by the large number of probable trees in the posterior distribution. In this paper we develop methods to sidestep this problem using a recently introduced structure called the subsplit directed acyclic graph (sDAG). This structure can represent many trees at once, and local rearrangements of trees translate to methods of enlarging the sDAG. Here we propose two methods of introducing, ranking, and selecting local rearrangements on sDAGs to produce a collection of trees with high posterior density. One of these methods successfully recovers the set of high posterior density trees across a range of data sets. However, we find that a simpler strategy of aggregating trees into an sDAG in fact is computationally faster and returns a higher fraction of probable trees.
{"title":"Finding high posterior density phylogenies by systematically extending a directed acyclic graph.","authors":"Chris Jennings-Shaffer, David H Rich, Matthew Macaulay, Michael D Karcher, Tanvi Ganapathy, Shosuke Kiami, Anna Kooperberg, Cheng Zhang, Marc A Suchard, Frederick A Matsen","doi":"10.1186/s13015-025-00273-x","DOIUrl":"10.1186/s13015-025-00273-x","url":null,"abstract":"<p><p>Bayesian phylogenetics typically estimates a posterior distribution, or aspects thereof, using Markov chain Monte Carlo methods. These methods integrate over tree space by applying local rearrangements to move a tree through its space as a random walk. Previous work explored the possibility of replacing this random walk with a systematic search, but was quickly overwhelmed by the large number of probable trees in the posterior distribution. In this paper we develop methods to sidestep this problem using a recently introduced structure called the subsplit directed acyclic graph (sDAG). This structure can represent many trees at once, and local rearrangements of trees translate to methods of enlarging the sDAG. Here we propose two methods of introducing, ranking, and selecting local rearrangements on sDAGs to produce a collection of trees with high posterior density. One of these methods successfully recovers the set of high posterior density trees across a range of data sets. However, we find that a simpler strategy of aggregating trees into an sDAG in fact is computationally faster and returns a higher fraction of probable trees.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"2"},"PeriodicalIF":1.5,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11869616/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143532146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-08DOI: 10.1186/s13015-024-00268-0
Timothé Rouzé, Igor Martayan, Camille Marchet, Antoine Limasset
The exponential increase in publicly available sequencing data and genomic resources necessitates the development of highly efficient methods for data processing and analysis. Locality-sensitive hashing techniques have successfully transformed large datasets into smaller, more manageable sketches while maintaining comparability using metrics such as Jaccard and containment indices. However, fixed-size sketches encounter difficulties when applied to divergent datasets. Scalable sketching methods, such as sourmash, provide valuable solutions but still lack resource-efficient, tailored indexing. Our objective is to create lighter sketches with comparable results while enhancing efficiency. We introduce the concept of Fractional Hitting Sets, a generalization of Universal Hitting Sets, which cover a specified fraction of the k-mer space. In theory and practice, we demonstrate the feasibility of achieving such coverage with simple but highly efficient schemes. By encoding the covered k-mers as super-k-mers, we provide a space-efficient exact representation that also enables optimized comparisons. Our novel tool, supersampler, implements this scheme, and experimental results with real bacterial collections closely match our theoretical findings. In comparison to sourmash, supersampler achieves similar outcomes while utilizing an order of magnitude less space and memory and operating several times faster. This highlights the potential of our approach in addressing the challenges presented by the ever-expanding landscape of genomic data. supersampler is an open-source software and can be accessed at https://github.com/TimRouze/supersampler . The data required to reproduce the results presented in this manuscript is available at https://github.com/TimRouze/supersampler/experiments .
{"title":"Fractional hitting sets for efficient multiset sketching.","authors":"Timothé Rouzé, Igor Martayan, Camille Marchet, Antoine Limasset","doi":"10.1186/s13015-024-00268-0","DOIUrl":"10.1186/s13015-024-00268-0","url":null,"abstract":"<p><p>The exponential increase in publicly available sequencing data and genomic resources necessitates the development of highly efficient methods for data processing and analysis. Locality-sensitive hashing techniques have successfully transformed large datasets into smaller, more manageable sketches while maintaining comparability using metrics such as Jaccard and containment indices. However, fixed-size sketches encounter difficulties when applied to divergent datasets. Scalable sketching methods, such as sourmash, provide valuable solutions but still lack resource-efficient, tailored indexing. Our objective is to create lighter sketches with comparable results while enhancing efficiency. We introduce the concept of Fractional Hitting Sets, a generalization of Universal Hitting Sets, which cover a specified fraction of the k-mer space. In theory and practice, we demonstrate the feasibility of achieving such coverage with simple but highly efficient schemes. By encoding the covered k-mers as super-k-mers, we provide a space-efficient exact representation that also enables optimized comparisons. Our novel tool, supersampler, implements this scheme, and experimental results with real bacterial collections closely match our theoretical findings. In comparison to sourmash, supersampler achieves similar outcomes while utilizing an order of magnitude less space and memory and operating several times faster. This highlights the potential of our approach in addressing the challenges presented by the ever-expanding landscape of genomic data. supersampler is an open-source software and can be accessed at https://github.com/TimRouze/supersampler . The data required to reproduce the results presented in this manuscript is available at https://github.com/TimRouze/supersampler/experiments .</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"1"},"PeriodicalIF":1.5,"publicationDate":"2025-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11807336/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143374779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-24DOI: 10.1186/s13015-024-00269-z
Luís Cunha, Ignasi Sau, Uéverton Souza
Genome rearrangements are events where large blocks of DNA exchange places during evolution. The analysis of these events is a promising tool for understanding evolutionary genomics, providing data for phylogenetic reconstruction based on genome rearrangement measures. Many pairwise rearrangement distances have been proposed, based on finding the minimum number of rearrangement events to transform one genome into the other, using some predefined operation. When more than two genomes are considered, we have the more challenging problem of rearrangement-based phylogeny reconstruction. Given a set of genomes and a distance notion, there are at least two natural ways to define the "target" genome. On the one hand, finding a genome that minimizes the sum of the distances from this to any other, called the median genome. On the other hand, finding a genome that minimizes the maximum distance to any other, called the closest genome. Considering genomes as permutations of distinct integers, some distance metrics have been extensively studied. We investigate the median and closest problems on permutations over the following metrics: breakpoint distance, swap distance, block-interchange distance, short-block-move distance, and transposition distance. In biological applications some values are usually very small, such as the solution value d or the number k of input permutations. For each of these metrics and parameters d or k, we analyze the closest and the median problems from the viewpoint of parameterized complexity. We obtain the following results: NP-hardness for finding the median/closest permutation regarding some metrics of distance, even for only permutations; Polynomial kernels for the problems of finding the median permutation of all studied metrics, considering the target distance d as parameter; NP-hardness result for finding the closest permutation by short-block-moves; FPT algorithms and infeasibility of polynomial kernels for finding the closest permutation for some metrics when parameterized by the target distance d.
{"title":"On the parameterized complexity of the median and closest problems under some permutation metrics.","authors":"Luís Cunha, Ignasi Sau, Uéverton Souza","doi":"10.1186/s13015-024-00269-z","DOIUrl":"10.1186/s13015-024-00269-z","url":null,"abstract":"<p><p>Genome rearrangements are events where large blocks of DNA exchange places during evolution. The analysis of these events is a promising tool for understanding evolutionary genomics, providing data for phylogenetic reconstruction based on genome rearrangement measures. Many pairwise rearrangement distances have been proposed, based on finding the minimum number of rearrangement events to transform one genome into the other, using some predefined operation. When more than two genomes are considered, we have the more challenging problem of rearrangement-based phylogeny reconstruction. Given a set of genomes and a distance notion, there are at least two natural ways to define the \"target\" genome. On the one hand, finding a genome that minimizes the sum of the distances from this to any other, called the median genome. On the other hand, finding a genome that minimizes the maximum distance to any other, called the closest genome. Considering genomes as permutations of distinct integers, some distance metrics have been extensively studied. We investigate the median and closest problems on permutations over the following metrics: breakpoint distance, swap distance, block-interchange distance, short-block-move distance, and transposition distance. In biological applications some values are usually very small, such as the solution value d or the number k of input permutations. For each of these metrics and parameters d or k, we analyze the closest and the median problems from the viewpoint of parameterized complexity. We obtain the following results: NP-hardness for finding the median/closest permutation regarding some metrics of distance, even for only <math><mrow><mi>k</mi> <mo>=</mo> <mn>3</mn></mrow> </math> permutations; Polynomial kernels for the problems of finding the median permutation of all studied metrics, considering the target distance d as parameter; NP-hardness result for finding the closest permutation by short-block-moves; FPT algorithms and infeasibility of polynomial kernels for finding the closest permutation for some metrics when parameterized by the target distance d.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"24"},"PeriodicalIF":1.5,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11669244/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142885647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-05DOI: 10.1186/s13015-024-00266-2
Elizabeth S Allman, Hector Baños, Jonathan D Mitchell, John A Rhodes
The tree of blobs of a species network shows only the tree-like aspects of relationships of taxa on a network, omitting information on network substructures where hybridization or other types of lateral transfer of genetic information occur. By isolating such regions of a network, inference of the tree of blobs can serve as a starting point for a more detailed investigation, or indicate the limit of what may be inferrable without additional assumptions. Building on our theoretical work on the identifiability of the tree of blobs from gene quartet distributions under the Network Multispecies Coalescent model, we develop an algorithm, TINNiK, for statistically consistent tree of blobs inference. We provide examples of its application to both simulated and empirical datasets, utilizing an implementation in the MSCquartets 2.0 R package.
物种网络的 "花叶树 "只显示了网络中类群关系的树状方面,而忽略了发生杂交或其他类型遗传信息横向转移的网络子结构的信息。通过分离网络中的这些区域,推断 "斑点树 "可以作为更详细研究的起点,或表明在没有额外假设的情况下可以推断的极限。基于我们在网络多物种凝聚模型下从基因四元组分布中得出的花叶树可识别性的理论研究,我们开发了一种算法 TINNiK,用于统计一致的花叶树推断。我们利用 MSCquartets 2.0 R 软件包中的实现,提供了该算法在模拟和经验数据集上的应用实例。
{"title":"TINNiK: inference of the tree of blobs of a species network under the coalescent model.","authors":"Elizabeth S Allman, Hector Baños, Jonathan D Mitchell, John A Rhodes","doi":"10.1186/s13015-024-00266-2","DOIUrl":"10.1186/s13015-024-00266-2","url":null,"abstract":"<p><p>The tree of blobs of a species network shows only the tree-like aspects of relationships of taxa on a network, omitting information on network substructures where hybridization or other types of lateral transfer of genetic information occur. By isolating such regions of a network, inference of the tree of blobs can serve as a starting point for a more detailed investigation, or indicate the limit of what may be inferrable without additional assumptions. Building on our theoretical work on the identifiability of the tree of blobs from gene quartet distributions under the Network Multispecies Coalescent model, we develop an algorithm, TINNiK, for statistically consistent tree of blobs inference. We provide examples of its application to both simulated and empirical datasets, utilizing an implementation in the MSCquartets 2.0 R package.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"23"},"PeriodicalIF":1.5,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11539473/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142584929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-05DOI: 10.1186/s13015-024-00267-1
Mahsa Farnia, Nadia Tahiri
The B cell lineage tree encapsulates the successive phases of B cell differentiation and maturation, transitioning from hematopoietic stem cells to mature, antibody-secreting cells within the immune system. Mathematically, this lineage can be conceptualized as an evolutionary tree, where each node represents a distinct stage in B cell development, and the edges reflect the differentiation pathways. To compare these lineage trees, a rigorous mathematical metric is essential. Analyzing B cell lineage trees mathematically and quantifying changes in lineage attributes over time necessitates a comparison methodology capable of accurately assessing and measuring these changes. Addressing the intricacies of multiple B cell lineage tree comparisons, this study introduces a novel metric that enhances the precision of comparative analysis. This metric is formulated on principles of metric theory and evolutionary biology, quantifying the dissimilarities between lineage trees by measuring branch length distance and weight. By providing a framework for systematically classifying lineage trees, this metric facilitates the development of predictive models that are crucial for the creation of targeted immunotherapy and vaccines. To validate the effectiveness of this new metric, synthetic datasets that mimic the complexity and variability of real B cell lineage structures are employed. We demonstrated the ability of the new metric method to accurately capture the evolutionary nuances of B cell lineages.
B 细胞系树概括了 B 细胞分化和成熟的连续阶段,从造血干细胞过渡到免疫系统中成熟的抗体分泌细胞。从数学角度看,这一谱系可概念化为一棵进化树,其中每个节点代表 B 细胞发育的一个不同阶段,而边缘则反映了分化途径。要比较这些系谱树,严格的数学度量是必不可少的。要对 B 细胞系树进行数学分析并量化系属性随时间发生的变化,就需要一种能够准确评估和衡量这些变化的比较方法。针对多 B 细胞系树比较的复杂性,本研究引入了一种新的度量方法,以提高比较分析的精确性。该指标是根据度量理论和进化生物学原理制定的,通过测量分支长度距离和权重来量化世系树之间的差异。通过提供一个对系谱树进行系统分类的框架,该指标有助于开发对创建靶向免疫疗法和疫苗至关重要的预测模型。为了验证这一新指标的有效性,我们采用了模拟真实 B 细胞系结构的复杂性和可变性的合成数据集。我们证明了新度量方法准确捕捉 B 细胞系进化细微差别的能力。
{"title":"New generalized metric based on branch length distance to compare B cell lineage trees.","authors":"Mahsa Farnia, Nadia Tahiri","doi":"10.1186/s13015-024-00267-1","DOIUrl":"10.1186/s13015-024-00267-1","url":null,"abstract":"<p><p>The B cell lineage tree encapsulates the successive phases of B cell differentiation and maturation, transitioning from hematopoietic stem cells to mature, antibody-secreting cells within the immune system. Mathematically, this lineage can be conceptualized as an evolutionary tree, where each node represents a distinct stage in B cell development, and the edges reflect the differentiation pathways. To compare these lineage trees, a rigorous mathematical metric is essential. Analyzing B cell lineage trees mathematically and quantifying changes in lineage attributes over time necessitates a comparison methodology capable of accurately assessing and measuring these changes. Addressing the intricacies of multiple B cell lineage tree comparisons, this study introduces a novel metric that enhances the precision of comparative analysis. This metric is formulated on principles of metric theory and evolutionary biology, quantifying the dissimilarities between lineage trees by measuring branch length distance and weight. By providing a framework for systematically classifying lineage trees, this metric facilitates the development of predictive models that are crucial for the creation of targeted immunotherapy and vaccines. To validate the effectiveness of this new metric, synthetic datasets that mimic the complexity and variability of real B cell lineage structures are employed. We demonstrated the ability of the new metric method to accurately capture the evolutionary nuances of B cell lineages.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"22"},"PeriodicalIF":1.5,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11453055/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142378550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-11DOI: 10.1186/s13015-024-00265-3
Stefan Canzar, Van Hoan Do, Slobodan Jelić, Sören Laue, Domagoj Matijević, Tomislav Prusina
Metric multidimensional scaling is one of the classical methods for embedding data into low-dimensional Euclidean space. It creates the low-dimensional embedding by approximately preserving the pairwise distances between the input points. However, current state-of-the-art approaches only scale to a few thousand data points. For larger data sets such as those occurring in single-cell RNA sequencing experiments, the running time becomes prohibitively large and thus alternative methods such as PCA are widely used instead. Here, we propose a simple neural network-based approach for solving the metric multidimensional scaling problem that is orders of magnitude faster than previous state-of-the-art approaches, and hence scales to data sets with up to a few million cells. At the same time, it provides a non-linear mapping between high- and low-dimensional space that can place previously unseen cells in the same embedding.
{"title":"Metric multidimensional scaling for large single-cell datasets using neural networks.","authors":"Stefan Canzar, Van Hoan Do, Slobodan Jelić, Sören Laue, Domagoj Matijević, Tomislav Prusina","doi":"10.1186/s13015-024-00265-3","DOIUrl":"10.1186/s13015-024-00265-3","url":null,"abstract":"<p><p>Metric multidimensional scaling is one of the classical methods for embedding data into low-dimensional Euclidean space. It creates the low-dimensional embedding by approximately preserving the pairwise distances between the input points. However, current state-of-the-art approaches only scale to a few thousand data points. For larger data sets such as those occurring in single-cell RNA sequencing experiments, the running time becomes prohibitively large and thus alternative methods such as PCA are widely used instead. Here, we propose a simple neural network-based approach for solving the metric multidimensional scaling problem that is orders of magnitude faster than previous state-of-the-art approaches, and hence scales to data sets with up to a few million cells. At the same time, it provides a non-linear mapping between high- and low-dimensional space that can place previously unseen cells in the same embedding.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"21"},"PeriodicalIF":1.0,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11165904/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141307329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-26DOI: 10.1186/s13015-024-00254-6
Amatur Rahman, Yoann Dufresne, Paul Medvedev
A colored de Bruijn graph (also called a set of k-mer sets), is a set of k-mers with every k-mer assigned a set of colors. Colored de Bruijn graphs are used in a variety of applications, including variant calling, genome assembly, and database search. However, their size has posed a scalability challenge to algorithm developers and users. There have been numerous indexing data structures proposed that allow to store the graph compactly while supporting fast query operations. However, disk compression algorithms, which do not need to support queries on the compressed data and can thus be more space-efficient, have received little attention. The dearth of specialized compression tools has been a detriment to tool developers, tool users, and reproducibility efforts. In this paper, we develop a new tool that compresses colored de Bruijn graphs to disk, building on previous ideas for compression of k-mer sets and indexing colored de Bruijn graphs. We test our tool, called ESS-color, on various datasets, including both sequencing data and whole genomes. ESS-color achieves better compression than all evaluated tools and all datasets, with no other tool able to consistently achieve less than 44% space overhead. The software is available at http://github.com/medvedevgroup/ESSColor .
彩色 de Bruijn 图(也称 k-mer 集)是一组 k-mer 的集合,每个 k-mer 都有一组颜色。彩色德布鲁因图可用于多种应用,包括变体调用、基因组组装和数据库搜索。然而,它们的大小给算法开发人员和用户带来了可扩展性的挑战。目前已经有许多索引数据结构被提出,它们可以紧凑地存储图,同时支持快速查询操作。然而,磁盘压缩算法却很少受到关注,因为这种算法不需要支持对压缩数据的查询,因此更节省空间。专业压缩工具的缺乏对工具开发者、工具用户和可重复性工作都是一种损害。在本文中,我们以之前的 k-mer 集压缩和彩色 de Bruijn 图索引的想法为基础,开发了一种将彩色 de Bruijn 图压缩到磁盘的新工具。我们在各种数据集(包括测序数据和全基因组)上测试了名为 ESS-color 的工具。ESS-color比所有评估过的工具和所有数据集都实现了更好的压缩效果,没有其他工具能持续实现低于44%的空间开销。该软件可在 http://github.com/medvedevgroup/ESSColor 上下载。
{"title":"Compression algorithm for colored de Bruijn graphs.","authors":"Amatur Rahman, Yoann Dufresne, Paul Medvedev","doi":"10.1186/s13015-024-00254-6","DOIUrl":"10.1186/s13015-024-00254-6","url":null,"abstract":"<p><p>A colored de Bruijn graph (also called a set of k-mer sets), is a set of k-mers with every k-mer assigned a set of colors. Colored de Bruijn graphs are used in a variety of applications, including variant calling, genome assembly, and database search. However, their size has posed a scalability challenge to algorithm developers and users. There have been numerous indexing data structures proposed that allow to store the graph compactly while supporting fast query operations. However, disk compression algorithms, which do not need to support queries on the compressed data and can thus be more space-efficient, have received little attention. The dearth of specialized compression tools has been a detriment to tool developers, tool users, and reproducibility efforts. In this paper, we develop a new tool that compresses colored de Bruijn graphs to disk, building on previous ideas for compression of k-mer sets and indexing colored de Bruijn graphs. We test our tool, called ESS-color, on various datasets, including both sequencing data and whole genomes. ESS-color achieves better compression than all evaluated tools and all datasets, with no other tool able to consistently achieve less than 44% space overhead. The software is available at http://github.com/medvedevgroup/ESSColor .</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"20"},"PeriodicalIF":1.0,"publicationDate":"2024-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11129398/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141155161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}