Pub Date : 2023-02-06DOI: 10.4230/LIPIcs.CPM.2023.9
Massimo Equi, A. V. D. Griend, V. Mäkinen
Many problems that can be solved in quadratic time have bit-parallel speed-ups with factor $w$, where $w$ is the computer word size. A classic example is computing the edit distance of two strings of length $n$, which can be solved in $O(n^2/w)$ time. In a reasonable classical model of computation, one can assume $w=Theta(log n)$, and obtaining significantly better speed-ups is unlikely in the light of conditional lower bounds obtained for such problems. In this paper, we study the connection of bit-parallelism to quantum computation, aiming to see if a bit-parallel algorithm could be converted to a quantum algorithm with better than logarithmic speed-up. We focus on string matching in labeled graphs, the problem of finding an exact occurrence of a string as the label of a path in a graph. This problem admits a quadratic conditional lower bound under a very restricted class of graphs (Equi et al. ICALP 2019), stating that no algorithm in the classical model of computation can solve the problem in time $O(|P||E|^{1-epsilon})$ or $O(|P|^{1-epsilon}|E|)$. We show that a simple bit-parallel algorithm on such restricted family of graphs (level DAGs) can indeed be converted into a realistic quantum algorithm that attains subquadratic time complexity $O(|E|sqrt{|P|})$.
许多可以在二次时间内解决的问题都具有以$w$为因子的位并行加速,其中$w$是计算机字长。一个经典的例子是计算两个长度为$n$的字符串的编辑距离,这可以在$O(n^2/w)$时间内解决。在一个合理的经典计算模型中,我们可以假设$w=Theta(log n)$,考虑到这类问题的条件下界,获得明显更好的加速是不可能的。本文研究了位并行算法与量子计算的关系,旨在研究位并行算法是否可以转换为具有优于对数加速的量子算法。我们关注标记图中的字符串匹配问题,即找到一个字符串作为图中路径的标签的精确出现的问题。这个问题在一类非常有限的图(Equi et al.)下承认一个二次条件下界。(ICALP 2019),指出经典计算模型中没有算法可以及时解决问题$O(|P||E|^{1-epsilon})$或$O(|P|^{1-epsilon}|E|)$。我们证明,在这种受限的图族(水平dag)上的简单位并行算法确实可以转换为实现次二次时间复杂度的现实量子算法$O(|E|sqrt{|P|})$。
{"title":"From Bit-Parallelism to Quantum String Matching for Labelled Graphs","authors":"Massimo Equi, A. V. D. Griend, V. Mäkinen","doi":"10.4230/LIPIcs.CPM.2023.9","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2023.9","url":null,"abstract":"Many problems that can be solved in quadratic time have bit-parallel speed-ups with factor $w$, where $w$ is the computer word size. A classic example is computing the edit distance of two strings of length $n$, which can be solved in $O(n^2/w)$ time. In a reasonable classical model of computation, one can assume $w=Theta(log n)$, and obtaining significantly better speed-ups is unlikely in the light of conditional lower bounds obtained for such problems. In this paper, we study the connection of bit-parallelism to quantum computation, aiming to see if a bit-parallel algorithm could be converted to a quantum algorithm with better than logarithmic speed-up. We focus on string matching in labeled graphs, the problem of finding an exact occurrence of a string as the label of a path in a graph. This problem admits a quadratic conditional lower bound under a very restricted class of graphs (Equi et al. ICALP 2019), stating that no algorithm in the classical model of computation can solve the problem in time $O(|P||E|^{1-epsilon})$ or $O(|P|^{1-epsilon}|E|)$. We show that a simple bit-parallel algorithm on such restricted family of graphs (level DAGs) can indeed be converted into a realistic quantum algorithm that attains subquadratic time complexity $O(|E|sqrt{|P|})$.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116475193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-06DOI: 10.48550/arXiv.2302.02586
H. Bannai, Mitsuru Funakoshi, Kazuhiro Kurita, Yuto Nakashima, Kazuhisa Seto, T. Uno
LZ-End is a variant of the well-known Lempel-Ziv parsing family such that each phrase of the parsing has a previous occurrence, with the additional constraint that the previous occurrence must end at the end of a previous phrase. LZ-End was initially proposed as a greedy parsing, where each phrase is determined greedily from left to right, as the longest factor that satisfies the above constraint~[Kreft&Navarro, 2010]. In this work, we consider an optimal LZ-End parsing that has the minimum number of phrases in such parsings. We show that a decision version of computing the optimal LZ-End parsing is NP-complete by showing a reduction from the vertex cover problem. Moreover, we give a MAX-SAT formulation for the optimal LZ-End parsing adapting an approach for computing various NP-hard repetitiveness measures recently presented by [Bannai et al., 2022]. We also consider the approximation ratio of the size of greedy LZ-End parsing to the size of the optimal LZ-End parsing, and give a lower bound of the ratio which asymptotically approaches $2$.
{"title":"Optimal LZ-End Parsing is Hard","authors":"H. Bannai, Mitsuru Funakoshi, Kazuhiro Kurita, Yuto Nakashima, Kazuhisa Seto, T. Uno","doi":"10.48550/arXiv.2302.02586","DOIUrl":"https://doi.org/10.48550/arXiv.2302.02586","url":null,"abstract":"LZ-End is a variant of the well-known Lempel-Ziv parsing family such that each phrase of the parsing has a previous occurrence, with the additional constraint that the previous occurrence must end at the end of a previous phrase. LZ-End was initially proposed as a greedy parsing, where each phrase is determined greedily from left to right, as the longest factor that satisfies the above constraint~[Kreft&Navarro, 2010]. In this work, we consider an optimal LZ-End parsing that has the minimum number of phrases in such parsings. We show that a decision version of computing the optimal LZ-End parsing is NP-complete by showing a reduction from the vertex cover problem. Moreover, we give a MAX-SAT formulation for the optimal LZ-End parsing adapting an approach for computing various NP-hard repetitiveness measures recently presented by [Bannai et al., 2022]. We also consider the approximation ratio of the size of greedy LZ-End parsing to the size of the optimal LZ-End parsing, and give a lower bound of the ratio which asymptotically approaches $2$.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115876430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-01DOI: 10.48550/arXiv.2302.00724
Paweł Gawrychowski, Samah Ghazawi, G. M. Landau
An order-preserving square in a string is a fragment of the form $uv$ where $uneq v$ and $u$ is order-isomorphic to $v$. We show that a string $w$ of length $n$ over an alphabet of size $sigma$ contains $mathcal{O}(sigma n)$ order-preserving squares that are distinct as words. This improves the upper bound of $mathcal{O}(sigma^{2}n)$ by Kociumaka, Radoszewski, Rytter, and Wale'n [TCS 2016]. Further, for every $sigma$ and $n$ we exhibit a string with $Omega(sigma n)$ order-preserving squares that are distinct as words, thus establishing that our upper bound is asymptotically tight. Finally, we design an $mathcal{O}(sigma n)$ time algorithm that outputs all order-preserving squares that occur in a given string and are distinct as words. By our lower bound, this is optimal in the worst case.
{"title":"Order-Preserving Squares in Strings","authors":"Paweł Gawrychowski, Samah Ghazawi, G. M. Landau","doi":"10.48550/arXiv.2302.00724","DOIUrl":"https://doi.org/10.48550/arXiv.2302.00724","url":null,"abstract":"An order-preserving square in a string is a fragment of the form $uv$ where $uneq v$ and $u$ is order-isomorphic to $v$. We show that a string $w$ of length $n$ over an alphabet of size $sigma$ contains $mathcal{O}(sigma n)$ order-preserving squares that are distinct as words. This improves the upper bound of $mathcal{O}(sigma^{2}n)$ by Kociumaka, Radoszewski, Rytter, and Wale'n [TCS 2016]. Further, for every $sigma$ and $n$ we exhibit a string with $Omega(sigma n)$ order-preserving squares that are distinct as words, thus establishing that our upper bound is asymptotically tight. Finally, we design an $mathcal{O}(sigma n)$ time algorithm that outputs all order-preserving squares that occur in a given string and are distinct as words. By our lower bound, this is optimal in the worst case.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123152596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-23DOI: 10.48550/arXiv.2301.09477
P. Bille, J. Fischer, I. L. Gørtz, Max Rishøj Pedersen, Tord Stordalen
Given a string $S$ over an alphabet $Sigma$, the 'string indexing problem' is to preprocess $S$ to subsequently support efficient pattern matching queries, i.e., given a pattern string $P$ report all the occurrences of $P$ in $S$. In this paper we study the 'streaming sliding window string indexing problem'. Here the string $S$ arrives as a stream, one character at a time, and the goal is to maintain an index of the last $w$ characters, called the 'window', for a specified parameter $w$. At any point in time a pattern matching query for a pattern $P$ may arrive, also streamed one character at a time, and all occurrences of $P$ within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching. Our main result is a simple $O(w)$ space data structure that uses $O(log w)$ time with high probability to process each character from both the input string $S$ and the pattern string $P$. Reporting each occurrence from $P$ uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream. We also consider a delayed variant of the problem, where a query may be answered at any point within the next $delta$ characters that arrive from either stream. We present an $O(w + delta)$ space data structure for this problem that improves the above time bounds to $O(log(w/delta))$. In particular, for a delay of $delta = epsilon w$ we obtain an $O(w)$ space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees.
{"title":"Sliding Window String Indexing in Streams","authors":"P. Bille, J. Fischer, I. L. Gørtz, Max Rishøj Pedersen, Tord Stordalen","doi":"10.48550/arXiv.2301.09477","DOIUrl":"https://doi.org/10.48550/arXiv.2301.09477","url":null,"abstract":"Given a string $S$ over an alphabet $Sigma$, the 'string indexing problem' is to preprocess $S$ to subsequently support efficient pattern matching queries, i.e., given a pattern string $P$ report all the occurrences of $P$ in $S$. In this paper we study the 'streaming sliding window string indexing problem'. Here the string $S$ arrives as a stream, one character at a time, and the goal is to maintain an index of the last $w$ characters, called the 'window', for a specified parameter $w$. At any point in time a pattern matching query for a pattern $P$ may arrive, also streamed one character at a time, and all occurrences of $P$ within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching. Our main result is a simple $O(w)$ space data structure that uses $O(log w)$ time with high probability to process each character from both the input string $S$ and the pattern string $P$. Reporting each occurrence from $P$ uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream. We also consider a delayed variant of the problem, where a query may be answered at any point within the next $delta$ characters that arrive from either stream. We present an $O(w + delta)$ space data structure for this problem that improves the above time bounds to $O(log(w/delta))$. In particular, for a delay of $delta = epsilon w$ we obtain an $O(w)$ space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"61 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132238309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-15DOI: 10.48550/arXiv.2212.07870
Manuel Cáceres
The problem of String Matching to Labeled Graphs (SMLG) asks to find all the paths in a labeled graph $G = (V, E)$ whose spellings match that of an input string $S in Sigma^m$. SMLG can be solved in quadratic $O(m|E|)$ time [Amir et al., JALG], which was proven to be optimal by a recent lower bound conditioned on SETH [Equi et al., ICALP 2019]. The lower bound states that no strongly subquadratic time algorithm exists, even if restricted to directed acyclic graphs (DAGs). In this work we present the first parameterized algorithms for SMLG in DAGs. Our parameters capture the topological structure of $G$. All our results are derived from a generalization of the Knuth-Morris-Pratt algorithm [Park and Kim, CPM 1995] optimized to work in time proportional to the number of prefix-incomparable matches. To obtain the parameterization in the topological structure of $G$, we first study a special class of DAGs called funnels [Millani et al., JCO] and generalize them to $k$-funnels and the class $ST_k$. We present several novel characterizations and algorithmic contributions on both funnels and their generalizations.
标记图的字符串匹配问题(SMLG)要求在标记图$G = (V, E)$中找到所有的路径,其拼写与输入字符串$S in Sigma^m$匹配。SMLG可以在二次$O(m|E|)$ time中求解[Amir等人,JALG],最近在SETH条件下的下界证明了它是最优的[Equi等人,ICALP 2019]。下界表明不存在强次二次时间算法,即使局限于有向无环图(dag)。在这项工作中,我们提出了在dag中SMLG的第一个参数化算法。我们的参数捕获了$G$的拓扑结构。我们所有的结果都来自于Knuth-Morris-Pratt算法的推广[Park和Kim, CPM 1995],优化后的算法与前缀不可比较匹配的数量成正比。为了获得$G$拓扑结构中的参数化,我们首先研究了一类特殊的dag,称为漏斗[Millani et al., JCO],并将其推广到$k$-漏斗和$ST_k$类。我们提出了几个新的表征和算法贡献在这两个漏斗和他们的推广。
{"title":"Parameterized Algorithms for String Matching to DAGs: Funnels and Beyond","authors":"Manuel Cáceres","doi":"10.48550/arXiv.2212.07870","DOIUrl":"https://doi.org/10.48550/arXiv.2212.07870","url":null,"abstract":"The problem of String Matching to Labeled Graphs (SMLG) asks to find all the paths in a labeled graph $G = (V, E)$ whose spellings match that of an input string $S in Sigma^m$. SMLG can be solved in quadratic $O(m|E|)$ time [Amir et al., JALG], which was proven to be optimal by a recent lower bound conditioned on SETH [Equi et al., ICALP 2019]. The lower bound states that no strongly subquadratic time algorithm exists, even if restricted to directed acyclic graphs (DAGs). In this work we present the first parameterized algorithms for SMLG in DAGs. Our parameters capture the topological structure of $G$. All our results are derived from a generalization of the Knuth-Morris-Pratt algorithm [Park and Kim, CPM 1995] optimized to work in time proportional to the number of prefix-incomparable matches. To obtain the parameterization in the topological structure of $G$, we first study a special class of DAGs called funnels [Millani et al., JCO] and generalize them to $k$-funnels and the class $ST_k$. We present several novel characterizations and algorithmic contributions on both funnels and their generalizations.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"461 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125811701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-19DOI: 10.48550/arXiv.2208.09351
E. Myers
Merging $T$ sorted, non-redundant lists containing $M$ elements into a single sorted, non-redundant result of size $N ge M/T$ is a classic problem typically solved practically in $O(M log T)$ time with a priority-queue data structure the most basic of which is the simple *heap*. We revisit this problem in the situation where the list elements are *strings* and the lists contain many *identical or nearly identical elements*. By keeping simple auxiliary information with each heap node, we devise an $O(M log T+S)$ worst-case method that performs no more character comparisons than the sum of the lengths of all the strings $S$, and another $O(M log (T/ bar e)+S)$ method that becomes progressively more efficient as a function of the fraction of equal elements $bar e = M/N$ between input lists, reaching linear time when the lists are all identical. The methods perform favorably in practice versus an alternate formulation based on a trie.
将包含$M$元素的$T$排序、非冗余列表合并到大小为$N ge M/T$的单个排序、非冗余结果中是一个经典问题,通常在$O(M log T)$时间内实际解决,使用优先级队列数据结构,其中最基本的是简单的*堆*。在列表元素是“字符串”并且列表包含许多“相同或几乎相同的元素”的情况下,我们重新审视这个问题。通过保留每个堆节点的简单辅助信息,我们设计了一种$O(M log T+S)$最坏情况方法,它执行的字符比较不超过所有字符串长度之和$S$,而另一种$O(M log (T/ bar e)+S)$方法作为输入列表之间相等元素的分数的函数变得越来越高效$bar e = M/N$,当列表都相同时达到线性时间。与基于试验的替代配方相比,所述方法在实践中表现良好。
{"title":"Merging Sorted Lists of Similar Strings","authors":"E. Myers","doi":"10.48550/arXiv.2208.09351","DOIUrl":"https://doi.org/10.48550/arXiv.2208.09351","url":null,"abstract":"Merging $T$ sorted, non-redundant lists containing $M$ elements into a single sorted, non-redundant result of size $N ge M/T$ is a classic problem typically solved practically in $O(M log T)$ time with a priority-queue data structure the most basic of which is the simple *heap*. We revisit this problem in the situation where the list elements are *strings* and the lists contain many *identical or nearly identical elements*. By keeping simple auxiliary information with each heap node, we devise an $O(M log T+S)$ worst-case method that performs no more character comparisons than the sum of the lengths of all the strings $S$, and another $O(M log (T/ bar e)+S)$ method that becomes progressively more efficient as a function of the fraction of equal elements $bar e = M/N$ between input lists, reaching linear time when the lists are all identical. The methods perform favorably in practice versus an alternate formulation based on a trie.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133705462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-03DOI: 10.48550/arXiv.2206.01688
G. Navarro, Cristian Urbina
An L-system (for lossless compression) is a CPD0L-system extended with two parameters $d$ and $n$, which determines unambiguously a string $w = tau(varphi^d(s))[1:n]$, where $varphi$ is the morphism of the system, $s$ is its axiom, and $tau$ is its coding. The length of the shortest description of an L-system generating $w$ is known as $ell$, and is arguably a relevant measure of repetitiveness that builds on the self-similarities that arise in the sequence. In this paper we deepen the study of the measure $ell$ and its relation with $delta$, a better established lower bound that builds on substring complexity. Our results show that $ell$ and $delta$ are largely orthogonal, in the sense that one can be much larger than the other depending on the case. This suggests that both sources of repetitiveness are mostly unrelated. We also show that the recently introduced NU-systems, which combine the capabilities of L-systems with bidirectional macro-schemes, can be asymptotically strictly smaller than both mechanisms, which makes the size $nu$ of the smallest NU-system the unique smallest reachable repetitiveness measure to date.
{"title":"L-systems for Measuring Repetitiveness","authors":"G. Navarro, Cristian Urbina","doi":"10.48550/arXiv.2206.01688","DOIUrl":"https://doi.org/10.48550/arXiv.2206.01688","url":null,"abstract":"An L-system (for lossless compression) is a CPD0L-system extended with two parameters $d$ and $n$, which determines unambiguously a string $w = tau(varphi^d(s))[1:n]$, where $varphi$ is the morphism of the system, $s$ is its axiom, and $tau$ is its coding. The length of the shortest description of an L-system generating $w$ is known as $ell$, and is arguably a relevant measure of repetitiveness that builds on the self-similarities that arise in the sequence. In this paper we deepen the study of the measure $ell$ and its relation with $delta$, a better established lower bound that builds on substring complexity. Our results show that $ell$ and $delta$ are largely orthogonal, in the sense that one can be much larger than the other depending on the case. This suggests that both sources of repetitiveness are mostly unrelated. We also show that the recently introduced NU-systems, which combine the capabilities of L-systems with bidirectional macro-schemes, can be asymptotically strictly smaller than both mechanisms, which makes the size $nu$ of the smallest NU-system the unique smallest reachable repetitiveness measure to date.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114903478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-04-12DOI: 10.48550/arXiv.2204.05969
Diego Díaz-Domínguez, G. Navarro
We present a new semi-external algorithm that builds the Burrows--Wheeler transform variant of Bauer et al. (a.k.a., BCR BWT) in linear expected time. Our method uses compression techniques to reduce computational costs when the input is massive and repetitive. Concretely, we build on induced suffix sorting (ISS) and resort to run-length and grammar compression to maintain our intermediate results in compact form. Our compression format not only saves space but also speeds up the required computations. Our experiments show important space and computation time savings when the text is repetitive. In moderate-size collections of real human genome assemblies (14.2 GB - 75.05 GB), our memory peak is, on average, 1.7x smaller than the peak of the state-of-the-art BCR BWT construction algorithm (texttt{ropebwt2}), while running 5x faster. Our current implementation was also able to compute the BCR BWT of 400 real human genome assemblies (1.2 TB) in 41.21 hours using 118.83 GB of working memory (around 10% of the input size). Interestingly, the results we report in the 1.2 TB file are dominated by the difficulties of scanning huge files under memory constraints (specifically, I/O operations). This fact indicates we can perform much better with a more careful implementation of our method, thus scaling to even bigger sizes efficiently.
{"title":"Efficient Construction of the BWT for Repetitive Text Using String Compression","authors":"Diego Díaz-Domínguez, G. Navarro","doi":"10.48550/arXiv.2204.05969","DOIUrl":"https://doi.org/10.48550/arXiv.2204.05969","url":null,"abstract":"We present a new semi-external algorithm that builds the Burrows--Wheeler transform variant of Bauer et al. (a.k.a., BCR BWT) in linear expected time. Our method uses compression techniques to reduce computational costs when the input is massive and repetitive. Concretely, we build on induced suffix sorting (ISS) and resort to run-length and grammar compression to maintain our intermediate results in compact form. Our compression format not only saves space but also speeds up the required computations. Our experiments show important space and computation time savings when the text is repetitive. In moderate-size collections of real human genome assemblies (14.2 GB - 75.05 GB), our memory peak is, on average, 1.7x smaller than the peak of the state-of-the-art BCR BWT construction algorithm (texttt{ropebwt2}), while running 5x faster. Our current implementation was also able to compute the BCR BWT of 400 real human genome assemblies (1.2 TB) in 41.21 hours using 118.83 GB of working memory (around 10% of the input size). Interestingly, the results we report in the 1.2 TB file are dominated by the difficulties of scanning huge files under memory constraints (specifically, I/O operations). This fact indicates we can perform much better with a more careful implementation of our method, thus scaling to even bigger sizes efficiently.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133981482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-04-09DOI: 10.48550/arXiv.2204.04422
Vincent Jug'e
We study the IS-algorithm, a well-known linear-time algorithm for computing the suffix array of a word. This algorithm relies on transforming the input word w into another word, called the reduced word of w , that will be at least twice shorter; then, the algorithm recursively computes the suffix array of the reduced word. In this article, we study the reduction ratio of the IS-algorithm, i.e., the ratio between the lengths of the input word and the word obtained after reducing k times the input word. We investigate both worst cases, in which we find precise results, and random cases, where we prove some strong convergence phenomena. Finally, we prove that, if the input word is a randomly chosen word of length n , we should not expect much more than log(log( n )) recursive function calls.
{"title":"Reduction ratio of the IS-algorithm: worst and random cases","authors":"Vincent Jug'e","doi":"10.48550/arXiv.2204.04422","DOIUrl":"https://doi.org/10.48550/arXiv.2204.04422","url":null,"abstract":"We study the IS-algorithm, a well-known linear-time algorithm for computing the suffix array of a word. This algorithm relies on transforming the input word w into another word, called the reduced word of w , that will be at least twice shorter; then, the algorithm recursively computes the suffix array of the reduced word. In this article, we study the reduction ratio of the IS-algorithm, i.e., the ratio between the lengths of the input word and the word obtained after reducing k times the input word. We investigate both worst cases, in which we find precise results, and random cases, where we prove some strong convergence phenomena. Finally, we prove that, if the input word is a randomly chosen word of length n , we should not expect much more than log(log( n )) recursive function calls.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125524107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-02-26DOI: 10.4230/LIPIcs.CPM.2022.25
David Cenzato, Zsuzsanna Lipt'ak
The extended Burrows-Wheeler-Transform (eBWT), introduced by Mantaci et al. [Theor. Comput. Sci., 2007], is a generalization of the Burrows-Wheeler-Transform (BWT) to multisets of strings. While the original BWT is based on the lexicographic order, the eBWT uses the omega-order, which differs from the lexicographic order in important ways. A number of tools are available that compute the BWT of string collections; however, the data structures they generate in most cases differ from the one originally defined, as well as from each other. In this paper, we review the differences between these BWT variants, both from a theoretical and from a practical point of view, comparing them on several real-life datasets with different characteristics. We find that the differences can be extensive, depending on the dataset characteristics, and are largest on collections of many highly similar short sequences. The widely-used parameter $r$, the number of runs of the BWT, also shows notable variation between the different BWT variants; on our datasets, it varied by a multiplicative factor of up to $4.2$.
{"title":"A theoretical and experimental analysis of BWT variants for string collections","authors":"David Cenzato, Zsuzsanna Lipt'ak","doi":"10.4230/LIPIcs.CPM.2022.25","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2022.25","url":null,"abstract":"The extended Burrows-Wheeler-Transform (eBWT), introduced by Mantaci et al. [Theor. Comput. Sci., 2007], is a generalization of the Burrows-Wheeler-Transform (BWT) to multisets of strings. While the original BWT is based on the lexicographic order, the eBWT uses the omega-order, which differs from the lexicographic order in important ways. A number of tools are available that compute the BWT of string collections; however, the data structures they generate in most cases differ from the one originally defined, as well as from each other. In this paper, we review the differences between these BWT variants, both from a theoretical and from a practical point of view, comparing them on several real-life datasets with different characteristics. We find that the differences can be extensive, depending on the dataset characteristics, and are largest on collections of many highly similar short sequences. The widely-used parameter $r$, the number of runs of the BWT, also shows notable variation between the different BWT variants; on our datasets, it varied by a multiplicative factor of up to $4.2$.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125616717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}