{"title":"Finding maximal exact matches in graphs","authors":"Nicola Rizzo, Manuel Cáceres, Veli Mäkinen","doi":"10.1186/s13015-024-00255-5","DOIUrl":null,"url":null,"abstract":"We study the problem of finding maximal exact matches (MEMs) between a query string Q and a labeled graph G. MEMs are an important class of seeds, often used in seed-chain-extend type of practical alignment methods because of their strong connections to classical metrics. A principled way to speed up chaining is to limit the number of MEMs by considering only MEMs of length at least $$\\kappa$$ ( $$\\kappa$$ -MEMs). However, on arbitrary input graphs, the problem of finding MEMs cannot be solved in truly sub-quadratic time under SETH (Equi et al., TALG 2023) even on acyclic graphs. In this paper we show an $$O(n\\cdot L \\cdot d^{L-1} + m + M_{\\kappa ,L})$$ -time algorithm finding all $$\\kappa$$ -MEMs between Q and G spanning exactly L nodes in G, where n is the total length of node labels, d is the maximum degree of a node in G, $$m = |Q|$$ , and $$M_{\\kappa ,L}$$ is the number of output MEMs. We use this algorithm to develop a $$\\kappa$$ -MEM finding solution on indexable Elastic Founder Graphs (Equi et al., Algorithmica 2022) running in time $$O(nH^2 + m + M_\\kappa )$$ , where H is the maximum number of nodes in a block, and $$M_\\kappa$$ is the total number of $$\\kappa$$ -MEMs. Our results generalize to the analysis of multiple query strings (MEMs between G and any of the strings). Additionally, we provide some experimental results showing that the number of graph MEMs is an order of magnitude smaller than the number of string MEMs of the corresponding concatenated collection. We show that seed-chain-extend type of alignment methods can be implemented on top of indexable Elastic Founder Graphs by providing an efficient way to produce the seeds between a set of queries and the graph. The code is available in https://github.com/algbio/efg-mems .","PeriodicalId":1,"journal":{"name":"Accounts of Chemical Research","volume":null,"pages":null},"PeriodicalIF":16.4000,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Accounts of Chemical Research","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13015-024-00255-5","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
We study the problem of finding maximal exact matches (MEMs) between a query string Q and a labeled graph G. MEMs are an important class of seeds, often used in seed-chain-extend type of practical alignment methods because of their strong connections to classical metrics. A principled way to speed up chaining is to limit the number of MEMs by considering only MEMs of length at least $$\kappa$$ ( $$\kappa$$ -MEMs). However, on arbitrary input graphs, the problem of finding MEMs cannot be solved in truly sub-quadratic time under SETH (Equi et al., TALG 2023) even on acyclic graphs. In this paper we show an $$O(n\cdot L \cdot d^{L-1} + m + M_{\kappa ,L})$$ -time algorithm finding all $$\kappa$$ -MEMs between Q and G spanning exactly L nodes in G, where n is the total length of node labels, d is the maximum degree of a node in G, $$m = |Q|$$ , and $$M_{\kappa ,L}$$ is the number of output MEMs. We use this algorithm to develop a $$\kappa$$ -MEM finding solution on indexable Elastic Founder Graphs (Equi et al., Algorithmica 2022) running in time $$O(nH^2 + m + M_\kappa )$$ , where H is the maximum number of nodes in a block, and $$M_\kappa$$ is the total number of $$\kappa$$ -MEMs. Our results generalize to the analysis of multiple query strings (MEMs between G and any of the strings). Additionally, we provide some experimental results showing that the number of graph MEMs is an order of magnitude smaller than the number of string MEMs of the corresponding concatenated collection. We show that seed-chain-extend type of alignment methods can be implemented on top of indexable Elastic Founder Graphs by providing an efficient way to produce the seeds between a set of queries and the graph. The code is available in https://github.com/algbio/efg-mems .
期刊介绍:
Accounts of Chemical Research presents short, concise and critical articles offering easy-to-read overviews of basic research and applications in all areas of chemistry and biochemistry. These short reviews focus on research from the author’s own laboratory and are designed to teach the reader about a research project. In addition, Accounts of Chemical Research publishes commentaries that give an informed opinion on a current research problem. Special Issues online are devoted to a single topic of unusual activity and significance.
Accounts of Chemical Research replaces the traditional article abstract with an article "Conspectus." These entries synopsize the research affording the reader a closer look at the content and significance of an article. Through this provision of a more detailed description of the article contents, the Conspectus enhances the article's discoverability by search engines and the exposure for the research.