Algorithms for Molecular Biology最新文献

On the parameterized complexity of the median and closest problems under some permutation metrics.

IF 1.5 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology

Pub Date : 2024-12-24 DOI: 10.1186/s13015-024-00269-z

Luís Cunha, Ignasi Sau, Uéverton Souza

Genome rearrangements are events where large blocks of DNA exchange places during evolution. The analysis of these events is a promising tool for understanding evolutionary genomics, providing data for phylogenetic reconstruction based on genome rearrangement measures. Many pairwise rearrangement distances have been proposed, based on finding the minimum number of rearrangement events to transform one genome into the other, using some predefined operation. When more than two genomes are considered, we have the more challenging problem of rearrangement-based phylogeny reconstruction. Given a set of genomes and a distance notion, there are at least two natural ways to define the "target" genome. On the one hand, finding a genome that minimizes the sum of the distances from this to any other, called the median genome. On the other hand, finding a genome that minimizes the maximum distance to any other, called the closest genome. Considering genomes as permutations of distinct integers, some distance metrics have been extensively studied. We investigate the median and closest problems on permutations over the following metrics: breakpoint distance, swap distance, block-interchange distance, short-block-move distance, and transposition distance. In biological applications some values are usually very small, such as the solution value d or the number k of input permutations. For each of these metrics and parameters d or k, we analyze the closest and the median problems from the viewpoint of parameterized complexity. We obtain the following results: NP-hardness for finding the median/closest permutation regarding some metrics of distance, even for only $k = 3$ permutations; Polynomial kernels for the problems of finding the median permutation of all studied metrics, considering the target distance d as parameter; NP-hardness result for finding the closest permutation by short-block-moves; FPT algorithms and infeasibility of polynomial kernels for finding the closest permutation for some metrics when parameterized by the target distance d.

{"title":"On the parameterized complexity of the median and closest problems under some permutation metrics.","authors":"Luís Cunha, Ignasi Sau, Uéverton Souza","doi":"10.1186/s13015-024-00269-z","DOIUrl":"10.1186/s13015-024-00269-z","url":null,"abstract":"Genome rearrangements are events where large blocks of DNA exchange places during evolution. The analysis of these events is a promising tool for understanding evolutionary genomics, providing data for phylogenetic reconstruction based on genome rearrangement measures. Many pairwise rearrangement distances have been proposed, based on finding the minimum number of rearrangement events to transform one genome into the other, using some predefined operation. When more than two genomes are considered, we have the more challenging problem of rearrangement-based phylogeny reconstruction. Given a set of genomes and a distance notion, there are at least two natural ways to define the \"target\" genome. On the one hand, finding a genome that minimizes the sum of the distances from this to any other, called the median genome. On the other hand, finding a genome that minimizes the maximum distance to any other, called the closest genome. Considering genomes as permutations of distinct integers, some distance metrics have been extensively studied. We investigate the median and closest problems on permutations over the following metrics: breakpoint distance, swap distance, block-interchange distance, short-block-move distance, and transposition distance. In biological applications some values are usually very small, such as the solution value d or the number k of input permutations. For each of these metrics and parameters d or k, we analyze the closest and the median problems from the viewpoint of parameterized complexity. We obtain the following results: NP-hardness for finding the median/closest permutation regarding some metrics of distance, even for only <math><mrow><mi>k</mi> <mo>=</mo> <mn>3</mn></mrow> </math> permutations; Polynomial kernels for the problems of finding the median permutation of all studied metrics, considering the target distance d as parameter; NP-hardness result for finding the closest permutation by short-block-moves; FPT algorithms and infeasibility of polynomial kernels for finding the closest permutation for some metrics when parameterized by the target distance d.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"24"},"PeriodicalIF":1.5,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11669244/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142885647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TINNiK: inference of the tree of blobs of a species network under the coalescent model. TINNiK：聚合模型下的物种网络 Blob 树推断。

IF 1.5 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology

Pub Date : 2024-11-05 DOI: 10.1186/s13015-024-00266-2

Elizabeth S Allman, Hector Baños, Jonathan D Mitchell, John A Rhodes

The tree of blobs of a species network shows only the tree-like aspects of relationships of taxa on a network, omitting information on network substructures where hybridization or other types of lateral transfer of genetic information occur. By isolating such regions of a network, inference of the tree of blobs can serve as a starting point for a more detailed investigation, or indicate the limit of what may be inferrable without additional assumptions. Building on our theoretical work on the identifiability of the tree of blobs from gene quartet distributions under the Network Multispecies Coalescent model, we develop an algorithm, TINNiK, for statistically consistent tree of blobs inference. We provide examples of its application to both simulated and empirical datasets, utilizing an implementation in the MSCquartets 2.0 R package.

物种网络的 "花叶树 "只显示了网络中类群关系的树状方面，而忽略了发生杂交或其他类型遗传信息横向转移的网络子结构的信息。通过分离网络中的这些区域，推断 "斑点树 "可以作为更详细研究的起点，或表明在没有额外假设的情况下可以推断的极限。基于我们在网络多物种凝聚模型下从基因四元组分布中得出的花叶树可识别性的理论研究，我们开发了一种算法 TINNiK，用于统计一致的花叶树推断。我们利用 MSCquartets 2.0 R 软件包中的实现，提供了该算法在模拟和经验数据集上的应用实例。

引用次数: 0

New generalized metric based on branch length distance to compare B cell lineage trees. 基于分支长度距离的新通用指标，用于比较 B 细胞系树。

IF 1.5 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology

Pub Date : 2024-10-05 DOI: 10.1186/s13015-024-00267-1

Mahsa Farnia, Nadia Tahiri

The B cell lineage tree encapsulates the successive phases of B cell differentiation and maturation, transitioning from hematopoietic stem cells to mature, antibody-secreting cells within the immune system. Mathematically, this lineage can be conceptualized as an evolutionary tree, where each node represents a distinct stage in B cell development, and the edges reflect the differentiation pathways. To compare these lineage trees, a rigorous mathematical metric is essential. Analyzing B cell lineage trees mathematically and quantifying changes in lineage attributes over time necessitates a comparison methodology capable of accurately assessing and measuring these changes. Addressing the intricacies of multiple B cell lineage tree comparisons, this study introduces a novel metric that enhances the precision of comparative analysis. This metric is formulated on principles of metric theory and evolutionary biology, quantifying the dissimilarities between lineage trees by measuring branch length distance and weight. By providing a framework for systematically classifying lineage trees, this metric facilitates the development of predictive models that are crucial for the creation of targeted immunotherapy and vaccines. To validate the effectiveness of this new metric, synthetic datasets that mimic the complexity and variability of real B cell lineage structures are employed. We demonstrated the ability of the new metric method to accurately capture the evolutionary nuances of B cell lineages.

B 细胞系树概括了 B 细胞分化和成熟的连续阶段，从造血干细胞过渡到免疫系统中成熟的抗体分泌细胞。从数学角度看，这一谱系可概念化为一棵进化树，其中每个节点代表 B 细胞发育的一个不同阶段，而边缘则反映了分化途径。要比较这些系谱树，严格的数学度量是必不可少的。要对 B 细胞系树进行数学分析并量化系属性随时间发生的变化，就需要一种能够准确评估和衡量这些变化的比较方法。针对多 B 细胞系树比较的复杂性，本研究引入了一种新的度量方法，以提高比较分析的精确性。该指标是根据度量理论和进化生物学原理制定的，通过测量分支长度距离和权重来量化世系树之间的差异。通过提供一个对系谱树进行系统分类的框架，该指标有助于开发对创建靶向免疫疗法和疫苗至关重要的预测模型。为了验证这一新指标的有效性，我们采用了模拟真实 B 细胞系结构的复杂性和可变性的合成数据集。我们证明了新度量方法准确捕捉 B 细胞系进化细微差别的能力。

{"title":"New generalized metric based on branch length distance to compare B cell lineage trees.","authors":"Mahsa Farnia, Nadia Tahiri","doi":"10.1186/s13015-024-00267-1","DOIUrl":"10.1186/s13015-024-00267-1","url":null,"abstract":"The B cell lineage tree encapsulates the successive phases of B cell differentiation and maturation, transitioning from hematopoietic stem cells to mature, antibody-secreting cells within the immune system. Mathematically, this lineage can be conceptualized as an evolutionary tree, where each node represents a distinct stage in B cell development, and the edges reflect the differentiation pathways. To compare these lineage trees, a rigorous mathematical metric is essential. Analyzing B cell lineage trees mathematically and quantifying changes in lineage attributes over time necessitates a comparison methodology capable of accurately assessing and measuring these changes. Addressing the intricacies of multiple B cell lineage tree comparisons, this study introduces a novel metric that enhances the precision of comparative analysis. This metric is formulated on principles of metric theory and evolutionary biology, quantifying the dissimilarities between lineage trees by measuring branch length distance and weight. By providing a framework for systematically classifying lineage trees, this metric facilitates the development of predictive models that are crucial for the creation of targeted immunotherapy and vaccines. To validate the effectiveness of this new metric, synthetic datasets that mimic the complexity and variability of real B cell lineage structures are employed. We demonstrated the ability of the new metric method to accurately capture the evolutionary nuances of B cell lineages.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"22"},"PeriodicalIF":1.5,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11453055/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142378550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Metric multidimensional scaling for large single-cell datasets using neural networks. 利用神经网络对大型单细胞数据集进行度量多维缩放。

IF 1 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology

Pub Date : 2024-06-11 DOI: 10.1186/s13015-024-00265-3

Stefan Canzar, Van Hoan Do, Slobodan Jelić, Sören Laue, Domagoj Matijević, Tomislav Prusina

Metric multidimensional scaling is one of the classical methods for embedding data into low-dimensional Euclidean space. It creates the low-dimensional embedding by approximately preserving the pairwise distances between the input points. However, current state-of-the-art approaches only scale to a few thousand data points. For larger data sets such as those occurring in single-cell RNA sequencing experiments, the running time becomes prohibitively large and thus alternative methods such as PCA are widely used instead. Here, we propose a simple neural network-based approach for solving the metric multidimensional scaling problem that is orders of magnitude faster than previous state-of-the-art approaches, and hence scales to data sets with up to a few million cells. At the same time, it provides a non-linear mapping between high- and low-dimensional space that can place previously unseen cells in the same embedding.

度量多维缩放是将数据嵌入低维欧几里得空间的经典方法之一。它通过近似保留输入点之间的成对距离来创建低维嵌入。然而，目前最先进的方法只能对几千个数据点进行缩放。对于单细胞 RNA 测序实验等较大的数据集，运行时间会变得过长，因此 PCA 等替代方法被广泛使用。在这里，我们提出了一种基于神经网络的简单方法来解决度量多维缩放问题，这种方法比以往最先进的方法要快几个数量级，因此可扩展到多达几百万个细胞的数据集。同时，它还提供了高维空间和低维空间之间的非线性映射，可将以前未见过的单元格置于相同的嵌入中。

引用次数: 0

Compression algorithm for colored de Bruijn graphs. 彩色德布鲁因图的压缩算法。

IF 1 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology

Pub Date : 2024-05-26 DOI: 10.1186/s13015-024-00254-6

Amatur Rahman, Yoann Dufresne, Paul Medvedev

A colored de Bruijn graph (also called a set of k-mer sets), is a set of k-mers with every k-mer assigned a set of colors. Colored de Bruijn graphs are used in a variety of applications, including variant calling, genome assembly, and database search. However, their size has posed a scalability challenge to algorithm developers and users. There have been numerous indexing data structures proposed that allow to store the graph compactly while supporting fast query operations. However, disk compression algorithms, which do not need to support queries on the compressed data and can thus be more space-efficient, have received little attention. The dearth of specialized compression tools has been a detriment to tool developers, tool users, and reproducibility efforts. In this paper, we develop a new tool that compresses colored de Bruijn graphs to disk, building on previous ideas for compression of k-mer sets and indexing colored de Bruijn graphs. We test our tool, called ESS-color, on various datasets, including both sequencing data and whole genomes. ESS-color achieves better compression than all evaluated tools and all datasets, with no other tool able to consistently achieve less than 44% space overhead. The software is available at http://github.com/medvedevgroup/ESSColor .

彩色 de Bruijn 图（也称 k-mer 集）是一组 k-mer 的集合，每个 k-mer 都有一组颜色。彩色德布鲁因图可用于多种应用，包括变体调用、基因组组装和数据库搜索。然而，它们的大小给算法开发人员和用户带来了可扩展性的挑战。目前已经有许多索引数据结构被提出，它们可以紧凑地存储图，同时支持快速查询操作。然而，磁盘压缩算法却很少受到关注，因为这种算法不需要支持对压缩数据的查询，因此更节省空间。专业压缩工具的缺乏对工具开发者、工具用户和可重复性工作都是一种损害。在本文中，我们以之前的 k-mer 集压缩和彩色 de Bruijn 图索引的想法为基础，开发了一种将彩色 de Bruijn 图压缩到磁盘的新工具。我们在各种数据集（包括测序数据和全基因组）上测试了名为 ESS-color 的工具。ESS-color比所有评估过的工具和所有数据集都实现了更好的压缩效果，没有其他工具能持续实现低于44%的空间开销。该软件可在 http://github.com/medvedevgroup/ESSColor 上下载。

{"title":"Compression algorithm for colored de Bruijn graphs.","authors":"Amatur Rahman, Yoann Dufresne, Paul Medvedev","doi":"10.1186/s13015-024-00254-6","DOIUrl":"10.1186/s13015-024-00254-6","url":null,"abstract":"A colored de Bruijn graph (also called a set of k-mer sets), is a set of k-mers with every k-mer assigned a set of colors. Colored de Bruijn graphs are used in a variety of applications, including variant calling, genome assembly, and database search. However, their size has posed a scalability challenge to algorithm developers and users. There have been numerous indexing data structures proposed that allow to store the graph compactly while supporting fast query operations. However, disk compression algorithms, which do not need to support queries on the compressed data and can thus be more space-efficient, have received little attention. The dearth of specialized compression tools has been a detriment to tool developers, tool users, and reproducibility efforts. In this paper, we develop a new tool that compresses colored de Bruijn graphs to disk, building on previous ideas for compression of k-mer sets and indexing colored de Bruijn graphs. We test our tool, called ESS-color, on various datasets, including both sequencing data and whole genomes. ESS-color achieves better compression than all evaluated tools and all datasets, with no other tool able to consistently achieve less than 44% space overhead. The software is available at http://github.com/medvedevgroup/ESSColor .","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"20"},"PeriodicalIF":1.0,"publicationDate":"2024-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11129398/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141155161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ESKEMAP: exact sketch-based read mapping ESKEMAP：基于草图的精确读取映射

IF 1 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology

Pub Date : 2024-05-04 DOI: 10.1186/s13015-024-00261-7

Tizian Schulz, Paul Medvedev

Given a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a “similar sequence”. Traditionally, “similar sequence” was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. In this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in $$mathcal {O} (|t| + |p| + ell ^2)$$ time and $$mathcal {O} (ell log ell )$$ space, where |t| is the number of $$k$$ -mers inside the sketch of the reference, |p| is the number of $$k$$ -mers inside the read’s sketch and $$ell$$ is the number of times that $$k$$ -mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm’s performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2.

给定一个测序读数，读数映射的总体目标是找到参考基因组中具有 "相似序列 "的位置。传统上，"相似序列 "被定义为具有较高的比对得分，读取映射器被视为这一明确问题的启发式解决方案。然而，对于基于草图的映射器来说，还没有一个问题表述来说明基于草图的精确映射算法应该解决什么问题。此外，目前还没有一种基于草图的方法能为超过一定分数阈值的读数找到所有可能的映射位置。在本文中，我们从序列草图的层面提出了读取映射问题。我们给出了一种精确的动态编程算法，该算法能找到超过给定相似度阈值的所有映射位置。它的运行时间为 $$mathcal {O} (|t| + |p| + ell ^2)$$，运行空间为 $$mathcal {O} (ell log ell )$$，其中 |t| 是参照草图内 $$k$$ -mers的数量、|p|是阅读草图中 $$k$ -mers的数量，$$ell$$是模式草图中 $$k$ -mers在文本草图中出现的次数。我们评估了我们的算法在将长读数映射到人类 Y 染色体的 T2T 组装中的性能，在该组装中，扩增区域使得找到所有好的映射位置成为了理想。在精度与 minimap2 相当的情况下，我们算法的召回率为 0.88，而 minimap2 只有 0.76。

{"title":"ESKEMAP: exact sketch-based read mapping","authors":"Tizian Schulz, Paul Medvedev","doi":"10.1186/s13015-024-00261-7","DOIUrl":"https://doi.org/10.1186/s13015-024-00261-7","url":null,"abstract":"Given a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a “similar sequence”. Traditionally, “similar sequence” was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. In this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in $$mathcal {O} (|t| + |p| + ell ^2)$$ time and $$mathcal {O} (ell log ell )$$ space, where |t| is the number of $$k$$ -mers inside the sketch of the reference, |p| is the number of $$k$$ -mers inside the read’s sketch and $$ell$$ is the number of times that $$k$$ -mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm’s performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":1.0,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140833439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NestedBD: Bayesian inference of phylogenetic trees from single-cell copy number profiles under a birth-death model NestedBD：在出生-死亡模型下从单细胞拷贝数剖面对系统发生树进行贝叶斯推断

IF 1 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology

Pub Date : 2024-04-29 DOI: 10.1186/s13015-024-00264-4

Yushu Liu, Mohammadamin Edrisi, Zhi Yan, Huw A Ogilvie, Luay Nakhleh

Copy number aberrations (CNAs) are ubiquitous in many types of cancer. Inferring CNAs from cancer genomic data could help shed light on the initiation, progression, and potential treatment of cancer. While such data have traditionally been available via “bulk sequencing,” the more recently introduced techniques for single-cell DNA sequencing (scDNAseq) provide the type of data that makes CNA inference possible at the single-cell resolution. We introduce a new birth-death evolutionary model of CNAs and a Bayesian method, NestedBD, for the inference of evolutionary trees (topologies and branch lengths with relative mutation rates) from single-cell data. We evaluated NestedBD’s performance using simulated data sets, benchmarking its accuracy against traditional phylogenetic tools as well as state-of-the-art methods. The results show that NestedBD infers more accurate topologies and branch lengths, and that the birth-death model can improve the accuracy of copy number estimation. And when applied to biological data sets, NestedBD infers plausible evolutionary histories of two colorectal cancer samples. NestedBD is available at https://github.com/Androstane/NestedBD .

拷贝数畸变（CNA）在许多类型的癌症中无处不在。从癌症基因组数据中推断 CNAs 有助于揭示癌症的发生、发展和潜在治疗方法。虽然此类数据传统上可通过 "批量测序 "获得，但最近推出的单细胞 DNA 测序（scDNAseq）技术提供的数据类型使单细胞分辨率的 CNA 推断成为可能。我们介绍了一种新的 CNA 出生-死亡进化模型和一种贝叶斯方法 NestedBD，用于从单细胞数据中推断进化树（拓扑结构和具有相对突变率的分支长度）。我们利用模拟数据集对 NestedBD 的性能进行了评估，并将其准确性与传统的系统发生学工具以及最先进的方法进行了比较。结果表明，NestedBD 能推断出更准确的拓扑结构和分支长度，出生-死亡模型能提高拷贝数估计的准确性。当应用于生物数据集时，NestedBD推断出了两个结直肠癌样本的合理进化史。NestedBD 可在 https://github.com/Androstane/NestedBD 上获取。

{"title":"NestedBD: Bayesian inference of phylogenetic trees from single-cell copy number profiles under a birth-death model","authors":"Yushu Liu, Mohammadamin Edrisi, Zhi Yan, Huw A Ogilvie, Luay Nakhleh","doi":"10.1186/s13015-024-00264-4","DOIUrl":"https://doi.org/10.1186/s13015-024-00264-4","url":null,"abstract":"Copy number aberrations (CNAs) are ubiquitous in many types of cancer. Inferring CNAs from cancer genomic data could help shed light on the initiation, progression, and potential treatment of cancer. While such data have traditionally been available via “bulk sequencing,” the more recently introduced techniques for single-cell DNA sequencing (scDNAseq) provide the type of data that makes CNA inference possible at the single-cell resolution. We introduce a new birth-death evolutionary model of CNAs and a Bayesian method, NestedBD, for the inference of evolutionary trees (topologies and branch lengths with relative mutation rates) from single-cell data. We evaluated NestedBD’s performance using simulated data sets, benchmarking its accuracy against traditional phylogenetic tools as well as state-of-the-art methods. The results show that NestedBD infers more accurate topologies and branch lengths, and that the birth-death model can improve the accuracy of copy number estimation. And when applied to biological data sets, NestedBD infers plausible evolutionary histories of two colorectal cancer samples. NestedBD is available at https://github.com/Androstane/NestedBD .","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"48 1","pages":""},"PeriodicalIF":1.0,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140833543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Revisiting the complexity of and algorithms for the graph traversal edit distance and its variants 重新审视图遍历编辑距离及其变体的复杂性和算法

IF 1 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology

Pub Date : 2024-04-29 DOI: 10.1186/s13015-024-00262-6

Yutong Qiu, Yihang Shen, Carl Kingsford

The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al. (2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error-prone process of genome assembly. Ebrahimpour Boroojeny et al. (2018) propose two ILP formulations for GTED and claim that GTED is polynomially solvable because the linear programming relaxation of one of the ILPs always yields optimal integer solutions. The claim that GTED is polynomially solvable is contradictory to the complexity results of existing string-to-graph matching problems. We resolve this conflict in complexity results by proving that GTED is NP-complete and showing that the ILPs proposed by Ebrahimpour Boroojeny et al. do not solve GTED but instead solve for a lower bound of GTED and are not solvable in polynomial time. In addition, we provide the first two, correct ILP formulations of GTED and evaluate their empirical efficiency. These results provide solid algorithmic foundations for comparing genome graphs and point to the direction of heuristics. The source code to reproduce experimental results is available at https://github.com/Kingsford-Group/gtednewilp/ .

Ebrahimpour Boroojeny 等人（2018 年）提出的图遍历编辑距离（GTED）是一种优雅的距离度量，定义为两个边缘标记图中由欧拉轨迹重建的字符串之间的最小编辑距离。GTED 可用于通过直接比较 de Bruijn 图来推断物种之间的进化关系，而无需计算成本高且容易出错的基因组组装过程。Ebrahimpour Boroojeny 等人（2018）为 GTED 提出了两个 ILP 公式，并声称 GTED 是多项式可解的，因为其中一个 ILP 的线性规划松弛总是能得到最优整数解。GTED 多项式可解的说法与现有字符串图匹配问题的复杂性结果相矛盾。我们通过证明 GTED 是 NP-完备的，并证明 Ebrahimpour Boroojeny 等人提出的 ILPs 并没有求解 GTED，而是求解了 GTED 的下限，且无法在多项式时间内求解，从而解决了复杂性结果中的这一矛盾。此外，我们还提供了 GTED 的前两个正确的 ILP 公式，并评估了它们的经验效率。这些结果为比较基因组图提供了坚实的算法基础，并指明了启发式算法的方向。重现实验结果的源代码可在 https://github.com/Kingsford-Group/gtednewilp/ 上获取。

{"title":"Revisiting the complexity of and algorithms for the graph traversal edit distance and its variants","authors":"Yutong Qiu, Yihang Shen, Carl Kingsford","doi":"10.1186/s13015-024-00262-6","DOIUrl":"https://doi.org/10.1186/s13015-024-00262-6","url":null,"abstract":"The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al. (2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error-prone process of genome assembly. Ebrahimpour Boroojeny et al. (2018) propose two ILP formulations for GTED and claim that GTED is polynomially solvable because the linear programming relaxation of one of the ILPs always yields optimal integer solutions. The claim that GTED is polynomially solvable is contradictory to the complexity results of existing string-to-graph matching problems. We resolve this conflict in complexity results by proving that GTED is NP-complete and showing that the ILPs proposed by Ebrahimpour Boroojeny et al. do not solve GTED but instead solve for a lower bound of GTED and are not solvable in polynomial time. In addition, we provide the first two, correct ILP formulations of GTED and evaluate their empirical efficiency. These results provide solid algorithmic foundations for comparing genome graphs and point to the direction of heuristics. The source code to reproduce experimental results is available at https://github.com/Kingsford-Group/gtednewilp/ .","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"75 1","pages":""},"PeriodicalIF":1.0,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140811499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fast, parallel, and cache-friendly suffix array construction 快速、并行和便于缓存的后缀阵列构建

IF 1 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology

Pub Date : 2024-04-28 DOI: 10.1186/s13015-024-00263-5

Jamshed Khan, Tobias Rubel, Erin Molloy, Laxman Dhulipala, Rob Patro

String indexes such as the suffix array (sa) and the closely related longest common prefix (lcp) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly non-trivial to implement and parallelize. In this paper we present caps-sa, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort and utilizing an LCP-informed mergesort. Due to its design, caps-sa has excellent memory-locality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache hierarchies. We show that despite its simple design, caps-sa outperforms existing state-of-the-art parallel sa and lcp-array construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a bounded-context sa and show that caps-sa can easily be extended to exploit this structure to obtain further speedups. We make our code publicly available at https://github.com/jamshed/CaPS-SA .

后缀数组（sa）和与之密切相关的最长公共前缀数组（lcp）等字符串索引是生物信息学中的基本对象，应用广泛。尽管它们在实践中非常重要，但构建它们的可扩展并行算法却寥寥无几，而且现有算法的实现和并行化也非常困难。在本文中，我们介绍了 caps-ssa，这是一种简单、可扩展的并行算法，用于构建这些字符串索引，其灵感来源于 samplesort，并利用了 LCP-informed mergesort。由于其设计，caps-ssa 具有出色的内存位置性，因此会减少缓存丢失，并在具有深度缓存层次结构的现代多核系统上实现强劲的性能。我们的研究表明，尽管设计简单，caps-sa 在现代硬件上的性能却优于现有的最先进的并行 sa 和 lcp 阵列构建算法。最后，受现代排列器中查询字符串长度有界的应用的启发，我们引入了有界上下文 sa 的概念，并证明 caps-sa 可以很容易地扩展到利用这种结构来获得更快的速度。我们在 https://github.com/jamshed/CaPS-SA 上公开了我们的代码。

{"title":"Fast, parallel, and cache-friendly suffix array construction","authors":"Jamshed Khan, Tobias Rubel, Erin Molloy, Laxman Dhulipala, Rob Patro","doi":"10.1186/s13015-024-00263-5","DOIUrl":"https://doi.org/10.1186/s13015-024-00263-5","url":null,"abstract":"String indexes such as the suffix array (sa) and the closely related longest common prefix (lcp) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly non-trivial to implement and parallelize. In this paper we present caps-sa, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort and utilizing an LCP-informed mergesort. Due to its design, caps-sa has excellent memory-locality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache hierarchies. We show that despite its simple design, caps-sa outperforms existing state-of-the-art parallel sa and lcp-array construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a bounded-context sa and show that caps-sa can easily be extended to exploit this structure to obtain further speedups. We make our code publicly available at https://github.com/jamshed/CaPS-SA .","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"75 1","pages":""},"PeriodicalIF":1.0,"publicationDate":"2024-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140811481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pfp-fm: an accelerated FM-index Pfp-fm：加速调频指数

IF 1 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology

Pub Date : 2024-04-10 DOI: 10.1186/s13015-024-00260-8

Aaron Hong, Marco Oliva, Dominik Köppl, Hideo Bannai, Christina Boucher, Travis Gagie

FM-indexes are crucial data structures in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [1] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. In 2022, Deng et al. [2] proposed parsing genomic data by induced suffix sorting, and showed that the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing—which takes parameters that let us tune the average length of the phrases—instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38, and is consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it seems our method accelerates the performance of count over all state-of-the-art methods with a moderate increase in the memory. The source code for $$texttt {PFP-FM}$$ is available at https://github.com/AaronHong1024/afm .

调频索引是 DNA 比对中的重要数据结构，但使用调频索引进行搜索通常需要对查询模式中的每个字符进行至少一次随机访问。Ferragina 和 Fischer [1] 在 2007 年发现，基于单词的索引通常比基于字符的索引使用更少的随机存取，因此支持更快的搜索。然而，由于 DNA 缺乏自然的词界，因此在应用基于词的调频索引之前，有必要对其进行某种解析。2022 年，Deng 等人[2]提出通过诱导后缀排序来解析基因组数据，结果表明，当模式为几千个字符或更长时，基于词的调频索引比标准调频索引支持更快的计数查询。在本文中，我们展示了使用无前缀解析法--它可以通过参数调整短语的平均长度--而不是诱导后缀排序法，可以显著提高只有几百个字符的模式的速度。我们实现了我们的方法，并证明它在查询 GRCh38 时比其他方法快 3 到 18 倍，而且在查询 25,000、50,000 和 100,000 个 SARS-CoV-2 基因组时速度始终较快。由此看来，我们的方法在适度增加内存的情况下，比所有最先进的方法都提高了计数性能。$$texttt {PFP-FM}$$ 的源代码可在 https://github.com/AaronHong1024/afm 上获取。

{"title":"Pfp-fm: an accelerated FM-index","authors":"Aaron Hong, Marco Oliva, Dominik Köppl, Hideo Bannai, Christina Boucher, Travis Gagie","doi":"10.1186/s13015-024-00260-8","DOIUrl":"https://doi.org/10.1186/s13015-024-00260-8","url":null,"abstract":"FM-indexes are crucial data structures in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [1] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. In 2022, Deng et al. [2] proposed parsing genomic data by induced suffix sorting, and showed that the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing—which takes parameters that let us tune the average length of the phrases—instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38, and is consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it seems our method accelerates the performance of count over all state-of-the-art methods with a moderate increase in the memory. The source code for $$texttt {PFP-FM}$$ is available at https://github.com/AaronHong1024/afm .","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"44 1","pages":""},"PeriodicalIF":1.0,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140583868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0