首页 > 最新文献

Algorithms for Molecular Biology最新文献

英文 中文
ESKEMAP: exact sketch-based read mapping ESKEMAP:基于草图的精确读取映射
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-05-04 DOI: 10.1186/s13015-024-00261-7
Tizian Schulz, Paul Medvedev
Given a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a “similar sequence”. Traditionally, “similar sequence” was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. In this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in $$mathcal {O} (|t| + |p| + ell ^2)$$ time and $$mathcal {O} (ell log ell )$$ space, where |t| is the number of $$k$$ -mers inside the sketch of the reference, |p| is the number of $$k$$ -mers inside the read’s sketch and $$ell$$ is the number of times that $$k$$ -mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm’s performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2.
给定一个测序读数,读数映射的总体目标是找到参考基因组中具有 "相似序列 "的位置。传统上,"相似序列 "被定义为具有较高的比对得分,读取映射器被视为这一明确问题的启发式解决方案。然而,对于基于草图的映射器来说,还没有一个问题表述来说明基于草图的精确映射算法应该解决什么问题。此外,目前还没有一种基于草图的方法能为超过一定分数阈值的读数找到所有可能的映射位置。在本文中,我们从序列草图的层面提出了读取映射问题。我们给出了一种精确的动态编程算法,该算法能找到超过给定相似度阈值的所有映射位置。它的运行时间为 $$mathcal {O} (|t| + |p| + ell ^2)$$,运行空间为 $$mathcal {O} (ell log ell )$$,其中 |t| 是参照草图内 $$k$$ -mers的数量、|p|是阅读草图中 $$k$ -mers的数量,$$ell$$是模式草图中 $$k$ -mers在文本草图中出现的次数。我们评估了我们的算法在将长读数映射到人类 Y 染色体的 T2T 组装中的性能,在该组装中,扩增区域使得找到所有好的映射位置成为了理想。在精度与 minimap2 相当的情况下,我们算法的召回率为 0.88,而 minimap2 只有 0.76。
{"title":"ESKEMAP: exact sketch-based read mapping","authors":"Tizian Schulz, Paul Medvedev","doi":"10.1186/s13015-024-00261-7","DOIUrl":"https://doi.org/10.1186/s13015-024-00261-7","url":null,"abstract":"Given a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a “similar sequence”. Traditionally, “similar sequence” was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. In this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in $$mathcal {O} (|t| + |p| + ell ^2)$$ time and $$mathcal {O} (ell log ell )$$ space, where |t| is the number of $$k$$ -mers inside the sketch of the reference, |p| is the number of $$k$$ -mers inside the read’s sketch and $$ell$$ is the number of times that $$k$$ -mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm’s performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":1.0,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140833439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NestedBD: Bayesian inference of phylogenetic trees from single-cell copy number profiles under a birth-death model NestedBD:在出生-死亡模型下从单细胞拷贝数剖面对系统发生树进行贝叶斯推断
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-04-29 DOI: 10.1186/s13015-024-00264-4
Yushu Liu, Mohammadamin Edrisi, Zhi Yan, Huw A Ogilvie, Luay Nakhleh
Copy number aberrations (CNAs) are ubiquitous in many types of cancer. Inferring CNAs from cancer genomic data could help shed light on the initiation, progression, and potential treatment of cancer. While such data have traditionally been available via “bulk sequencing,” the more recently introduced techniques for single-cell DNA sequencing (scDNAseq) provide the type of data that makes CNA inference possible at the single-cell resolution. We introduce a new birth-death evolutionary model of CNAs and a Bayesian method, NestedBD, for the inference of evolutionary trees (topologies and branch lengths with relative mutation rates) from single-cell data. We evaluated NestedBD’s performance using simulated data sets, benchmarking its accuracy against traditional phylogenetic tools as well as state-of-the-art methods. The results show that NestedBD infers more accurate topologies and branch lengths, and that the birth-death model can improve the accuracy of copy number estimation. And when applied to biological data sets, NestedBD infers plausible evolutionary histories of two colorectal cancer samples. NestedBD is available at https://github.com/Androstane/NestedBD .
拷贝数畸变(CNA)在许多类型的癌症中无处不在。从癌症基因组数据中推断 CNAs 有助于揭示癌症的发生、发展和潜在治疗方法。虽然此类数据传统上可通过 "批量测序 "获得,但最近推出的单细胞 DNA 测序(scDNAseq)技术提供的数据类型使单细胞分辨率的 CNA 推断成为可能。我们介绍了一种新的 CNA 出生-死亡进化模型和一种贝叶斯方法 NestedBD,用于从单细胞数据中推断进化树(拓扑结构和具有相对突变率的分支长度)。我们利用模拟数据集对 NestedBD 的性能进行了评估,并将其准确性与传统的系统发生学工具以及最先进的方法进行了比较。结果表明,NestedBD 能推断出更准确的拓扑结构和分支长度,出生-死亡模型能提高拷贝数估计的准确性。当应用于生物数据集时,NestedBD推断出了两个结直肠癌样本的合理进化史。NestedBD 可在 https://github.com/Androstane/NestedBD 上获取。
{"title":"NestedBD: Bayesian inference of phylogenetic trees from single-cell copy number profiles under a birth-death model","authors":"Yushu Liu, Mohammadamin Edrisi, Zhi Yan, Huw A Ogilvie, Luay Nakhleh","doi":"10.1186/s13015-024-00264-4","DOIUrl":"https://doi.org/10.1186/s13015-024-00264-4","url":null,"abstract":"Copy number aberrations (CNAs) are ubiquitous in many types of cancer. Inferring CNAs from cancer genomic data could help shed light on the initiation, progression, and potential treatment of cancer. While such data have traditionally been available via “bulk sequencing,” the more recently introduced techniques for single-cell DNA sequencing (scDNAseq) provide the type of data that makes CNA inference possible at the single-cell resolution. We introduce a new birth-death evolutionary model of CNAs and a Bayesian method, NestedBD, for the inference of evolutionary trees (topologies and branch lengths with relative mutation rates) from single-cell data. We evaluated NestedBD’s performance using simulated data sets, benchmarking its accuracy against traditional phylogenetic tools as well as state-of-the-art methods. The results show that NestedBD infers more accurate topologies and branch lengths, and that the birth-death model can improve the accuracy of copy number estimation. And when applied to biological data sets, NestedBD infers plausible evolutionary histories of two colorectal cancer samples. NestedBD is available at https://github.com/Androstane/NestedBD .","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"48 1","pages":""},"PeriodicalIF":1.0,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140833543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Revisiting the complexity of and algorithms for the graph traversal edit distance and its variants 重新审视图遍历编辑距离及其变体的复杂性和算法
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-04-29 DOI: 10.1186/s13015-024-00262-6
Yutong Qiu, Yihang Shen, Carl Kingsford
The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al. (2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error-prone process of genome assembly. Ebrahimpour Boroojeny et al. (2018) propose two ILP formulations for GTED and claim that GTED is polynomially solvable because the linear programming relaxation of one of the ILPs always yields optimal integer solutions. The claim that GTED is polynomially solvable is contradictory to the complexity results of existing string-to-graph matching problems. We resolve this conflict in complexity results by proving that GTED is NP-complete and showing that the ILPs proposed by Ebrahimpour Boroojeny et al. do not solve GTED but instead solve for a lower bound of GTED and are not solvable in polynomial time. In addition, we provide the first two, correct ILP formulations of GTED and evaluate their empirical efficiency. These results provide solid algorithmic foundations for comparing genome graphs and point to the direction of heuristics. The source code to reproduce experimental results is available at https://github.com/Kingsford-Group/gtednewilp/ .
Ebrahimpour Boroojeny 等人(2018 年)提出的图遍历编辑距离(GTED)是一种优雅的距离度量,定义为两个边缘标记图中由欧拉轨迹重建的字符串之间的最小编辑距离。GTED 可用于通过直接比较 de Bruijn 图来推断物种之间的进化关系,而无需计算成本高且容易出错的基因组组装过程。Ebrahimpour Boroojeny 等人(2018)为 GTED 提出了两个 ILP 公式,并声称 GTED 是多项式可解的,因为其中一个 ILP 的线性规划松弛总是能得到最优整数解。GTED 多项式可解的说法与现有字符串图匹配问题的复杂性结果相矛盾。我们通过证明 GTED 是 NP-完备的,并证明 Ebrahimpour Boroojeny 等人提出的 ILPs 并没有求解 GTED,而是求解了 GTED 的下限,且无法在多项式时间内求解,从而解决了复杂性结果中的这一矛盾。此外,我们还提供了 GTED 的前两个正确的 ILP 公式,并评估了它们的经验效率。这些结果为比较基因组图提供了坚实的算法基础,并指明了启发式算法的方向。重现实验结果的源代码可在 https://github.com/Kingsford-Group/gtednewilp/ 上获取。
{"title":"Revisiting the complexity of and algorithms for the graph traversal edit distance and its variants","authors":"Yutong Qiu, Yihang Shen, Carl Kingsford","doi":"10.1186/s13015-024-00262-6","DOIUrl":"https://doi.org/10.1186/s13015-024-00262-6","url":null,"abstract":"The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al. (2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error-prone process of genome assembly. Ebrahimpour Boroojeny et al. (2018) propose two ILP formulations for GTED and claim that GTED is polynomially solvable because the linear programming relaxation of one of the ILPs always yields optimal integer solutions. The claim that GTED is polynomially solvable is contradictory to the complexity results of existing string-to-graph matching problems. We resolve this conflict in complexity results by proving that GTED is NP-complete and showing that the ILPs proposed by Ebrahimpour Boroojeny et al. do not solve GTED but instead solve for a lower bound of GTED and are not solvable in polynomial time. In addition, we provide the first two, correct ILP formulations of GTED and evaluate their empirical efficiency. These results provide solid algorithmic foundations for comparing genome graphs and point to the direction of heuristics. The source code to reproduce experimental results is available at https://github.com/Kingsford-Group/gtednewilp/ .","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"75 1","pages":""},"PeriodicalIF":1.0,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140811499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast, parallel, and cache-friendly suffix array construction 快速、并行和便于缓存的后缀阵列构建
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-04-28 DOI: 10.1186/s13015-024-00263-5
Jamshed Khan, Tobias Rubel, Erin Molloy, Laxman Dhulipala, Rob Patro
String indexes such as the suffix array (sa) and the closely related longest common prefix (lcp) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly non-trivial to implement and parallelize. In this paper we present caps-sa, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort and utilizing an LCP-informed mergesort. Due to its design, caps-sa has excellent memory-locality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache hierarchies. We show that despite its simple design, caps-sa outperforms existing state-of-the-art parallel sa and lcp-array construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a bounded-context sa and show that caps-sa can easily be extended to exploit this structure to obtain further speedups. We make our code publicly available at https://github.com/jamshed/CaPS-SA .
后缀数组(sa)和与之密切相关的最长公共前缀数组(lcp)等字符串索引是生物信息学中的基本对象,应用广泛。尽管它们在实践中非常重要,但构建它们的可扩展并行算法却寥寥无几,而且现有算法的实现和并行化也非常困难。在本文中,我们介绍了 caps-ssa,这是一种简单、可扩展的并行算法,用于构建这些字符串索引,其灵感来源于 samplesort,并利用了 LCP-informed mergesort。由于其设计,caps-ssa 具有出色的内存位置性,因此会减少缓存丢失,并在具有深度缓存层次结构的现代多核系统上实现强劲的性能。我们的研究表明,尽管设计简单,caps-sa 在现代硬件上的性能却优于现有的最先进的并行 sa 和 lcp 阵列构建算法。最后,受现代排列器中查询字符串长度有界的应用的启发,我们引入了有界上下文 sa 的概念,并证明 caps-sa 可以很容易地扩展到利用这种结构来获得更快的速度。我们在 https://github.com/jamshed/CaPS-SA 上公开了我们的代码。
{"title":"Fast, parallel, and cache-friendly suffix array construction","authors":"Jamshed Khan, Tobias Rubel, Erin Molloy, Laxman Dhulipala, Rob Patro","doi":"10.1186/s13015-024-00263-5","DOIUrl":"https://doi.org/10.1186/s13015-024-00263-5","url":null,"abstract":"String indexes such as the suffix array (sa) and the closely related longest common prefix (lcp) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly non-trivial to implement and parallelize. In this paper we present caps-sa, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort and utilizing an LCP-informed mergesort. Due to its design, caps-sa has excellent memory-locality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache hierarchies. We show that despite its simple design, caps-sa outperforms existing state-of-the-art parallel sa and lcp-array construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a bounded-context sa and show that caps-sa can easily be extended to exploit this structure to obtain further speedups. We make our code publicly available at https://github.com/jamshed/CaPS-SA .","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"75 1","pages":""},"PeriodicalIF":1.0,"publicationDate":"2024-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140811481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pfp-fm: an accelerated FM-index Pfp-fm:加速调频指数
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-04-10 DOI: 10.1186/s13015-024-00260-8
Aaron Hong, Marco Oliva, Dominik Köppl, Hideo Bannai, Christina Boucher, Travis Gagie
FM-indexes are crucial data structures in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [1] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. In 2022, Deng et al. [2] proposed parsing genomic data by induced suffix sorting, and showed that the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing—which takes parameters that let us tune the average length of the phrases—instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38, and is consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it seems our method accelerates the performance of count over all state-of-the-art methods with a moderate increase in the memory. The source code for $$texttt {PFP-FM}$$ is available at https://github.com/AaronHong1024/afm .
调频索引是 DNA 比对中的重要数据结构,但使用调频索引进行搜索通常需要对查询模式中的每个字符进行至少一次随机访问。Ferragina 和 Fischer [1] 在 2007 年发现,基于单词的索引通常比基于字符的索引使用更少的随机存取,因此支持更快的搜索。然而,由于 DNA 缺乏自然的词界,因此在应用基于词的调频索引之前,有必要对其进行某种解析。2022 年,Deng 等人[2]提出通过诱导后缀排序来解析基因组数据,结果表明,当模式为几千个字符或更长时,基于词的调频索引比标准调频索引支持更快的计数查询。在本文中,我们展示了使用无前缀解析法--它可以通过参数调整短语的平均长度--而不是诱导后缀排序法,可以显著提高只有几百个字符的模式的速度。我们实现了我们的方法,并证明它在查询 GRCh38 时比其他方法快 3 到 18 倍,而且在查询 25,000、50,000 和 100,000 个 SARS-CoV-2 基因组时速度始终较快。由此看来,我们的方法在适度增加内存的情况下,比所有最先进的方法都提高了计数性能。$$texttt {PFP-FM}$$ 的源代码可在 https://github.com/AaronHong1024/afm 上获取。
{"title":"Pfp-fm: an accelerated FM-index","authors":"Aaron Hong, Marco Oliva, Dominik Köppl, Hideo Bannai, Christina Boucher, Travis Gagie","doi":"10.1186/s13015-024-00260-8","DOIUrl":"https://doi.org/10.1186/s13015-024-00260-8","url":null,"abstract":"FM-indexes are crucial data structures in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [1] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. In 2022, Deng et al. [2] proposed parsing genomic data by induced suffix sorting, and showed that the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing—which takes parameters that let us tune the average length of the phrases—instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38, and is consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it seems our method accelerates the performance of count over all state-of-the-art methods with a moderate increase in the memory. The source code for $$texttt {PFP-FM}$$ is available at https://github.com/AaronHong1024/afm .","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"44 1","pages":""},"PeriodicalIF":1.0,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140583868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Space-efficient computation of k-mer dictionaries for large values of k 针对大 k 值的 k-mer 字典的空间高效计算
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-04-05 DOI: 10.1186/s13015-024-00259-1
Diego Díaz-Domínguez, Miika Leinonen, Leena Salmela
Computing k-mer frequencies in a collection of reads is a common procedure in many genomic applications. Several state-of-the-art k-mer counters rely on hash tables to carry out this task but they are often optimised for small k as a hash table keeping keys explicitly (i.e., k-mer sequences) takes $$O(Nfrac{k}{w})$$ computer words, N being the number of distinct k-mers and w the computer word size, which is impractical for long values of k. This space usage is an important limitation as analysis of long and accurate HiFi sequencing reads can require larger values of k. We propose Kaarme, a space-efficient hash table for k-mers using $$O(N+ufrac{k}{w})$$ words of space, where u is the number of reads. Our framework exploits the fact that consecutive k-mers overlap by $$k-1$$ symbols. Thus, we only store the last symbol of a k-mer and a pointer within the hash table to a previous one, which we can use to recover the remaining $$k-1$$ symbols. We adapt Kaarme to compute canonical k-mers as well. This variant also uses pointers within the hash table to save space but requires more work to decode the k-mers. Specifically, it takes $$O(sigma ^{k})$$ time in the worst case, $$sigma$$ being the DNA alphabet, but our experiments show this is hardly ever the case. The canonical variant does not improve our theoretical results but greatly reduces space usage in practice while keeping a competitive performance to get the k-mers and their frequencies. We compare canonical Kaarme to a regular hash table storing canonical k-mers explicitly as keys and show that our method uses up to five times less space while being less than 1.5 times slower. We also show that canonical Kaarme uses significantly less memory than state-of-the-art k-mer counters when they do not resort to disk to keep intermediate results.
计算读数集合中的 k-mer 频率是许多基因组应用中的常见程序。几种最先进的 k-mer计数器依靠哈希表来完成这项任务,但它们通常针对小k进行优化,因为哈希表明确保存键(即k-mer序列)需要花费$$O(Nfrac{k}{w})$$的计算机字数、我们提出的 Kaarme 是一种空间效率高的 k-mers 哈希表,只需 $$O(N+u/frac{k}{w})$$个字的空间,其中 u 是读数的数量。我们的框架利用了这样一个事实,即连续的 k-mers 重叠了 $$k-1$$ 个符号。因此,我们只存储 k-mer的最后一个符号和哈希表中指向前一个符号的指针,我们可以用它来恢复剩余的$k-1$$符号。我们对 Kaarme 进行了调整,使其也能计算规范 k-mer。这种变体也使用哈希表中的指针来节省空间,但需要更多的工作来解码 k-mers。具体来说,在最糟糕的情况下,它需要花费 $$O(sigma^{k})$$时间,$$sigma$$$是 DNA 字母表,但我们的实验表明这种情况几乎不存在。规范变体并没有改善我们的理论结果,但在实践中大大减少了空间使用,同时在获取 k-mers 及其频率方面保持了极具竞争力的性能。我们比较了规范 Kaarme 和明确将规范 k-mers 作为键存储的普通哈希表,结果表明我们的方法占用的空间减少了五倍,而速度却慢了不到 1.5 倍。我们还证明,当不使用磁盘保存中间结果时,规范 Kaarme 使用的内存比最先进的 k-mer 计数器少得多。
{"title":"Space-efficient computation of k-mer dictionaries for large values of k","authors":"Diego Díaz-Domínguez, Miika Leinonen, Leena Salmela","doi":"10.1186/s13015-024-00259-1","DOIUrl":"https://doi.org/10.1186/s13015-024-00259-1","url":null,"abstract":"Computing k-mer frequencies in a collection of reads is a common procedure in many genomic applications. Several state-of-the-art k-mer counters rely on hash tables to carry out this task but they are often optimised for small k as a hash table keeping keys explicitly (i.e., k-mer sequences) takes $$O(Nfrac{k}{w})$$ computer words, N being the number of distinct k-mers and w the computer word size, which is impractical for long values of k. This space usage is an important limitation as analysis of long and accurate HiFi sequencing reads can require larger values of k. We propose Kaarme, a space-efficient hash table for k-mers using $$O(N+ufrac{k}{w})$$ words of space, where u is the number of reads. Our framework exploits the fact that consecutive k-mers overlap by $$k-1$$ symbols. Thus, we only store the last symbol of a k-mer and a pointer within the hash table to a previous one, which we can use to recover the remaining $$k-1$$ symbols. We adapt Kaarme to compute canonical k-mers as well. This variant also uses pointers within the hash table to save space but requires more work to decode the k-mers. Specifically, it takes $$O(sigma ^{k})$$ time in the worst case, $$sigma$$ being the DNA alphabet, but our experiments show this is hardly ever the case. The canonical variant does not improve our theoretical results but greatly reduces space usage in practice while keeping a competitive performance to get the k-mers and their frequencies. We compare canonical Kaarme to a regular hash table storing canonical k-mers explicitly as keys and show that our method uses up to five times less space while being less than 1.5 times slower. We also show that canonical Kaarme uses significantly less memory than state-of-the-art k-mer counters when they do not resort to disk to keep intermediate results.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"31 1","pages":""},"PeriodicalIF":1.0,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140583688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Infrared: a declarative tree decomposition-powered framework for bioinformatics. 红外线:一种用于生物信息学的声明式树分解框架。
IF 1.5 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-03-16 DOI: 10.1186/s13015-024-00258-2
Hua-Ting Yao, Bertrand Marchand, Sarah J Berkemer, Yann Ponty, Sebastian Will

Motivation: Many bioinformatics problems can be approached as optimization or controlled sampling tasks, and solved exactly and efficiently using Dynamic Programming (DP). However, such exact methods are typically tailored towards specific settings, complex to develop, and hard to implement and adapt to problem variations.

Methods: We introduce the Infrared framework to overcome such hindrances for a large class of problems. Its underlying paradigm is tailored toward problems that can be declaratively formalized as sparse feature networks, a generalization of constraint networks. Classic Boolean constraints specify a search space, consisting of putative solutions whose evaluation is performed through a combination of features. Problems are then solved using generic cluster tree elimination algorithms over a tree decomposition of the feature network. Their overall complexities are linear on the number of variables, and only exponential in the treewidth of the feature network. For sparse feature networks, associated with low to moderate treewidths, these algorithms allow to find optimal solutions, or generate controlled samples, with practical empirical efficiency.

Results: Implementing these methods, the Infrared software allows Python programmers to rapidly develop exact optimization and sampling applications based on a tree decomposition-based efficient processing. Instead of directly coding specialized algorithms, problems are declaratively modeled as sets of variables over finite domains, whose dependencies are captured by constraints and functions. Such models are then automatically solved by generic DP algorithms. To illustrate the applicability of Infrared in bioinformatics and guide new users, we model and discuss variants of bioinformatics applications. We provide reimplementations and extensions of methods for RNA design, RNA sequence-structure alignment, parsimony-driven inference of ancestral traits in phylogenetic trees/networks, and design of coding sequences. Moreover, we demonstrate multidimensional Boltzmann sampling. These applications of the framework-together with our novel results-underline the practical relevance of Infrared. Remarkably, the achieved complexities are typically equivalent to the ones of specialized algorithms and implementations.

Availability: Infrared is available at https://amibio.gitlabpages.inria.fr/Infrared with extensive documentation, including various usage examples and API reference; it can be installed using Conda or from source.

动机许多生物信息学问题都可以作为优化或受控采样任务来处理,并使用动态编程(Dynamic Programming,DP)精确高效地解决。然而,这种精确方法通常是针对特定环境定制的,开发起来很复杂,而且难以实施和适应问题的变化:我们引入了 Infrared 框架,以克服这类问题的障碍。该框架的基本范式是针对可以声明地形式化为稀疏特征网络(约束网络的一种概括)的问题而量身定制的。经典的布尔约束指定了一个搜索空间,该空间由推测的解决方案组成,通过特征组合对解决方案进行评估。然后,在特征网络的树形分解上使用通用的簇树消除算法来解决问题。这些算法的总体复杂度与变量数量呈线性关系,与特征网络的树宽呈指数关系。对于中低树宽的稀疏特征网络,这些算法可以找到最优解,或生成受控样本,具有实用的经验效率:利用这些方法,Infrared 软件允许 Python 程序员在基于树分解的高效处理基础上快速开发精确优化和采样应用程序。问题不是直接编码专门算法,而是声明性地建模为有限域上的变量集,其依赖关系由约束和函数捕获。然后,通用 DP 算法会自动解决这些模型。为了说明红外技术在生物信息学中的适用性并指导新用户,我们对生物信息学应用的变体进行了建模和讨论。我们对 RNA 设计、RNA 序列结构比对、系统发生树/网络中祖先性状的解析驱动推断以及编码序列设计等方法进行了重新实施和扩展。此外,我们还演示了多维玻尔兹曼采样。该框架的这些应用以及我们的新成果凸显了红外技术的实用性。值得注意的是,所实现的复杂性通常等同于专门算法和实现的复杂性:Infrared 可在 https://amibio.gitlabpages.inria.fr/Infrared 网站上获取,并附有大量文档,包括各种使用示例和 API 参考;可使用 Conda 或从源代码安装。
{"title":"Infrared: a declarative tree decomposition-powered framework for bioinformatics.","authors":"Hua-Ting Yao, Bertrand Marchand, Sarah J Berkemer, Yann Ponty, Sebastian Will","doi":"10.1186/s13015-024-00258-2","DOIUrl":"10.1186/s13015-024-00258-2","url":null,"abstract":"<p><strong>Motivation: </strong>Many bioinformatics problems can be approached as optimization or controlled sampling tasks, and solved exactly and efficiently using Dynamic Programming (DP). However, such exact methods are typically tailored towards specific settings, complex to develop, and hard to implement and adapt to problem variations.</p><p><strong>Methods: </strong>We introduce the Infrared framework to overcome such hindrances for a large class of problems. Its underlying paradigm is tailored toward problems that can be declaratively formalized as sparse feature networks, a generalization of constraint networks. Classic Boolean constraints specify a search space, consisting of putative solutions whose evaluation is performed through a combination of features. Problems are then solved using generic cluster tree elimination algorithms over a tree decomposition of the feature network. Their overall complexities are linear on the number of variables, and only exponential in the treewidth of the feature network. For sparse feature networks, associated with low to moderate treewidths, these algorithms allow to find optimal solutions, or generate controlled samples, with practical empirical efficiency.</p><p><strong>Results: </strong>Implementing these methods, the Infrared software allows Python programmers to rapidly develop exact optimization and sampling applications based on a tree decomposition-based efficient processing. Instead of directly coding specialized algorithms, problems are declaratively modeled as sets of variables over finite domains, whose dependencies are captured by constraints and functions. Such models are then automatically solved by generic DP algorithms. To illustrate the applicability of Infrared in bioinformatics and guide new users, we model and discuss variants of bioinformatics applications. We provide reimplementations and extensions of methods for RNA design, RNA sequence-structure alignment, parsimony-driven inference of ancestral traits in phylogenetic trees/networks, and design of coding sequences. Moreover, we demonstrate multidimensional Boltzmann sampling. These applications of the framework-together with our novel results-underline the practical relevance of Infrared. Remarkably, the achieved complexities are typically equivalent to the ones of specialized algorithms and implementations.</p><p><strong>Availability: </strong>Infrared is available at https://amibio.gitlabpages.inria.fr/Infrared with extensive documentation, including various usage examples and API reference; it can be installed using Conda or from source.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"13"},"PeriodicalIF":1.5,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10943887/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140141081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Median quartet tree search algorithms using optimal subtree prune and regraft. 使用最优子树修剪和重植的中位四叉树搜索算法。
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-03-13 DOI: 10.1186/s13015-024-00257-3
Shayesteh Arasti, Siavash Mirarab

Gene trees can be different from the species tree due to biological processes and inference errors. One way to obtain a species tree is to find one that maximizes some measure of similarity to a set of gene trees. The number of shared quartets between a potential species tree and gene trees provides a statistically justifiable score; if maximized properly, it could result in a statistically consistent estimator of the species tree under several statistical models of discordance. However, finding the median quartet score tree, one that maximizes this score, is NP-Hard, motivating several existing heuristic algorithms. These heuristics do not follow the hill-climbing paradigm used extensively in phylogenetics. In this paper, we make theoretical contributions that enable an efficient hill-climbing approach. Specifically, we show that a subtree of size m can be placed optimally on a tree of size n in quasi-linear time with respect to n and (almost) independently of m. This result enables us to perform subtree prune and regraft (SPR) rearrangements as part of a hill-climbing search. We show that this approach can slightly improve upon the results of widely-used methods such as ASTRAL in terms of the optimization score but not necessarily accuracy.

由于生物过程和推断错误,基因树可能与物种树不同。获得物种树的一种方法是找到一棵与一组基因树相似度最大的树。潜在的物种树与基因树之间共享四分位点的数量提供了一个统计上合理的分数;如果正确地最大化,在几种不一致的统计模型下,它可以产生一个统计上一致的物种树估计值。然而,寻找中位四分树,即最大化该分数的树,是一个 NP-困难的问题,这也是现有几种启发式算法的动机。这些启发式算法并不遵循系统发生学中广泛使用的爬山模式。在本文中,我们的理论贡献使得高效的爬坡方法成为可能。具体来说,我们证明了大小为 m 的子树可以在与 n 有关的准线性时间内以最佳方式放置在大小为 n 的树上,并且(几乎)与 m 无关。这一结果使我们能够在爬山搜索中执行子树修剪和重植(SPR)重新排列。我们的研究表明,与 ASTRAL 等广泛使用的方法相比,这种方法能在优化得分方面略有提高,但不一定能提高准确性。
{"title":"Median quartet tree search algorithms using optimal subtree prune and regraft.","authors":"Shayesteh Arasti, Siavash Mirarab","doi":"10.1186/s13015-024-00257-3","DOIUrl":"10.1186/s13015-024-00257-3","url":null,"abstract":"<p><p>Gene trees can be different from the species tree due to biological processes and inference errors. One way to obtain a species tree is to find one that maximizes some measure of similarity to a set of gene trees. The number of shared quartets between a potential species tree and gene trees provides a statistically justifiable score; if maximized properly, it could result in a statistically consistent estimator of the species tree under several statistical models of discordance. However, finding the median quartet score tree, one that maximizes this score, is NP-Hard, motivating several existing heuristic algorithms. These heuristics do not follow the hill-climbing paradigm used extensively in phylogenetics. In this paper, we make theoretical contributions that enable an efficient hill-climbing approach. Specifically, we show that a subtree of size m can be placed optimally on a tree of size n in quasi-linear time with respect to n and (almost) independently of m. This result enables us to perform subtree prune and regraft (SPR) rearrangements as part of a hill-climbing search. We show that this approach can slightly improve upon the results of widely-used methods such as ASTRAL in terms of the optimization score but not necessarily accuracy.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"12"},"PeriodicalIF":1.0,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10938725/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140121325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Suffix sorting via matching statistics. 通过匹配统计进行后缀排序
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-03-12 DOI: 10.1186/s13015-023-00245-z
Zsuzsanna Lipták, Francesco Masillo, Simon J Puglisi

We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a reference string. We then use this data structure to distribute suffixes into a partial order, and subsequently to speed up suffix comparisons to complete the generalized suffix array. Our experimental evidence with a prototype implementation (a tool we call sacamats) shows that on string collections with highly similar strings we can construct the suffix array in time competitive with or faster than the fastest available methods. Along the way, we describe a heuristic for fast computation of the matching statistics of two strings, which may be of independent interest.

我们引入了一种新算法,用于构建高度相似字符串集合的广义后缀数组。第一步,我们构建了一个压缩表示,表示该集合相对于参考字符串的匹配统计数据。然后,我们使用这种数据结构将后缀分配到部分顺序中,随后加快后缀比较,以完成广义后缀数组。我们使用原型实现(我们称之为 sacamats 的工具)进行的实验证明,在具有高度相似字符串的字符串集合上,我们构建后缀数组的时间可以与现有的最快方法相媲美,甚至更快。同时,我们还介绍了一种快速计算两个字符串匹配统计量的启发式方法,这可能也是我们感兴趣的地方。
{"title":"Suffix sorting via matching statistics.","authors":"Zsuzsanna Lipták, Francesco Masillo, Simon J Puglisi","doi":"10.1186/s13015-023-00245-z","DOIUrl":"10.1186/s13015-023-00245-z","url":null,"abstract":"<p><p>We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a reference string. We then use this data structure to distribute suffixes into a partial order, and subsequently to speed up suffix comparisons to complete the generalized suffix array. Our experimental evidence with a prototype implementation (a tool we call sacamats) shows that on string collections with highly similar strings we can construct the suffix array in time competitive with or faster than the fastest available methods. Along the way, we describe a heuristic for fast computation of the matching statistics of two strings, which may be of independent interest.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"11"},"PeriodicalIF":1.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10935992/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140112116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Finding maximal exact matches in graphs 在图中寻找最大精确匹配
IF 1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-03-11 DOI: 10.1186/s13015-024-00255-5
Nicola Rizzo, Manuel Cáceres, Veli Mäkinen
We study the problem of finding maximal exact matches (MEMs) between a query string Q and a labeled graph G. MEMs are an important class of seeds, often used in seed-chain-extend type of practical alignment methods because of their strong connections to classical metrics. A principled way to speed up chaining is to limit the number of MEMs by considering only MEMs of length at least $$kappa$$ ( $$kappa$$ -MEMs). However, on arbitrary input graphs, the problem of finding MEMs cannot be solved in truly sub-quadratic time under SETH (Equi et al., TALG 2023) even on acyclic graphs. In this paper we show an $$O(ncdot L cdot d^{L-1} + m + M_{kappa ,L})$$ -time algorithm finding all $$kappa$$ -MEMs between Q and G spanning exactly L nodes in G, where n is the total length of node labels, d is the maximum degree of a node in G, $$m = |Q|$$ , and $$M_{kappa ,L}$$ is the number of output MEMs. We use this algorithm to develop a $$kappa$$ -MEM finding solution on indexable Elastic Founder Graphs (Equi et al., Algorithmica 2022) running in time $$O(nH^2 + m + M_kappa )$$ , where H is the maximum number of nodes in a block, and $$M_kappa$$ is the total number of $$kappa$$ -MEMs. Our results generalize to the analysis of multiple query strings (MEMs between G and any of the strings). Additionally, we provide some experimental results showing that the number of graph MEMs is an order of magnitude smaller than the number of string MEMs of the corresponding concatenated collection. We show that seed-chain-extend type of alignment methods can be implemented on top of indexable Elastic Founder Graphs by providing an efficient way to produce the seeds between a set of queries and the graph. The code is available in https://github.com/algbio/efg-mems .
我们研究的问题是在查询字符串 Q 和标记图 G 之间寻找最大精确匹配(MEMs)。MEMs 是一类重要的种子,由于其与经典度量标准的紧密联系,经常被用于种子链扩展类型的实用配准方法。加快链式排列的一个原则性方法是限制 MEM 的数量,只考虑长度至少为 $$kappa$$ 的 MEM($$kappa$$ -MEM)。然而,在任意输入图上,即使是无环图,也无法在 SETH(Equi 等人,TALG 2023)下以真正的亚二次方时间解决寻找 MEMs 的问题。在本文中,我们展示了一种 $$O(ncdot L cdot d^{L-1} + m + M_{{kappa ,L})$$ 时的算法,可以找到 Q 和 G 之间的所有 $$kappa$ -MEM,它们正好跨越 G 中的 L 个节点,其中 n 是节点标签的总长度,d 是 G 中节点的最大度数,$$m = |Q|$$ ,$$M_{kappa ,L}$ 是输出 MEM 的数量。我们使用该算法在可索引的弹性方正图(Equi et al., Algorithmica 2022)上开发了一个 $$$kappa$ -MEM 查找解决方案,运行时间为 $$O(nH^2+m+M_kappa)$$,其中 H 是块中节点的最大数量,$$M_kappa$$ 是 $$kappa$ -MEM 的总数。我们的结果可以推广到多个查询字符串(G 与任意字符串之间的 MEM)的分析。此外,我们还提供了一些实验结果,表明图 MEMs 的数量比相应串联集合的字符串 MEMs 数量要少一个数量级。我们展示了种子链扩展类型的对齐方法,通过提供在一组查询和图之间生成种子的有效方法,可以在可索引的弹性方正图之上实现。代码见 https://github.com/algbio/efg-mems。
{"title":"Finding maximal exact matches in graphs","authors":"Nicola Rizzo, Manuel Cáceres, Veli Mäkinen","doi":"10.1186/s13015-024-00255-5","DOIUrl":"https://doi.org/10.1186/s13015-024-00255-5","url":null,"abstract":"We study the problem of finding maximal exact matches (MEMs) between a query string Q and a labeled graph G. MEMs are an important class of seeds, often used in seed-chain-extend type of practical alignment methods because of their strong connections to classical metrics. A principled way to speed up chaining is to limit the number of MEMs by considering only MEMs of length at least $$kappa$$ ( $$kappa$$ -MEMs). However, on arbitrary input graphs, the problem of finding MEMs cannot be solved in truly sub-quadratic time under SETH (Equi et al., TALG 2023) even on acyclic graphs. In this paper we show an $$O(ncdot L cdot d^{L-1} + m + M_{kappa ,L})$$ -time algorithm finding all $$kappa$$ -MEMs between Q and G spanning exactly L nodes in G, where n is the total length of node labels, d is the maximum degree of a node in G, $$m = |Q|$$ , and $$M_{kappa ,L}$$ is the number of output MEMs. We use this algorithm to develop a $$kappa$$ -MEM finding solution on indexable Elastic Founder Graphs (Equi et al., Algorithmica 2022) running in time $$O(nH^2 + m + M_kappa )$$ , where H is the maximum number of nodes in a block, and $$M_kappa$$ is the total number of $$kappa$$ -MEMs. Our results generalize to the analysis of multiple query strings (MEMs between G and any of the strings). Additionally, we provide some experimental results showing that the number of graph MEMs is an order of magnitude smaller than the number of string MEMs of the corresponding concatenated collection. We show that seed-chain-extend type of alignment methods can be implemented on top of indexable Elastic Founder Graphs by providing an efficient way to produce the seeds between a set of queries and the graph. The code is available in https://github.com/algbio/efg-mems .","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"40 1","pages":""},"PeriodicalIF":1.0,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140098576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Algorithms for Molecular Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1