首页 > 最新文献

Annual Symposium on Combinatorial Pattern Matching最新文献

英文 中文
From Bit-Parallelism to Quantum String Matching for Labelled Graphs 从位并行到标记图的量子字符串匹配
Pub Date : 2023-02-06 DOI: 10.4230/LIPIcs.CPM.2023.9
Massimo Equi, A. V. D. Griend, V. Mäkinen
Many problems that can be solved in quadratic time have bit-parallel speed-ups with factor $w$, where $w$ is the computer word size. A classic example is computing the edit distance of two strings of length $n$, which can be solved in $O(n^2/w)$ time. In a reasonable classical model of computation, one can assume $w=Theta(log n)$, and obtaining significantly better speed-ups is unlikely in the light of conditional lower bounds obtained for such problems. In this paper, we study the connection of bit-parallelism to quantum computation, aiming to see if a bit-parallel algorithm could be converted to a quantum algorithm with better than logarithmic speed-up. We focus on string matching in labeled graphs, the problem of finding an exact occurrence of a string as the label of a path in a graph. This problem admits a quadratic conditional lower bound under a very restricted class of graphs (Equi et al. ICALP 2019), stating that no algorithm in the classical model of computation can solve the problem in time $O(|P||E|^{1-epsilon})$ or $O(|P|^{1-epsilon}|E|)$. We show that a simple bit-parallel algorithm on such restricted family of graphs (level DAGs) can indeed be converted into a realistic quantum algorithm that attains subquadratic time complexity $O(|E|sqrt{|P|})$.
许多可以在二次时间内解决的问题都具有以$w$为因子的位并行加速,其中$w$是计算机字长。一个经典的例子是计算两个长度为$n$的字符串的编辑距离,这可以在$O(n^2/w)$时间内解决。在一个合理的经典计算模型中,我们可以假设$w=Theta(log n)$,考虑到这类问题的条件下界,获得明显更好的加速是不可能的。本文研究了位并行算法与量子计算的关系,旨在研究位并行算法是否可以转换为具有优于对数加速的量子算法。我们关注标记图中的字符串匹配问题,即找到一个字符串作为图中路径的标签的精确出现的问题。这个问题在一类非常有限的图(Equi et al.)下承认一个二次条件下界。(ICALP 2019),指出经典计算模型中没有算法可以及时解决问题$O(|P||E|^{1-epsilon})$或$O(|P|^{1-epsilon}|E|)$。我们证明,在这种受限的图族(水平dag)上的简单位并行算法确实可以转换为实现次二次时间复杂度的现实量子算法$O(|E|sqrt{|P|})$。
{"title":"From Bit-Parallelism to Quantum String Matching for Labelled Graphs","authors":"Massimo Equi, A. V. D. Griend, V. Mäkinen","doi":"10.4230/LIPIcs.CPM.2023.9","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2023.9","url":null,"abstract":"Many problems that can be solved in quadratic time have bit-parallel speed-ups with factor $w$, where $w$ is the computer word size. A classic example is computing the edit distance of two strings of length $n$, which can be solved in $O(n^2/w)$ time. In a reasonable classical model of computation, one can assume $w=Theta(log n)$, and obtaining significantly better speed-ups is unlikely in the light of conditional lower bounds obtained for such problems. In this paper, we study the connection of bit-parallelism to quantum computation, aiming to see if a bit-parallel algorithm could be converted to a quantum algorithm with better than logarithmic speed-up. We focus on string matching in labeled graphs, the problem of finding an exact occurrence of a string as the label of a path in a graph. This problem admits a quadratic conditional lower bound under a very restricted class of graphs (Equi et al. ICALP 2019), stating that no algorithm in the classical model of computation can solve the problem in time $O(|P||E|^{1-epsilon})$ or $O(|P|^{1-epsilon}|E|)$. We show that a simple bit-parallel algorithm on such restricted family of graphs (level DAGs) can indeed be converted into a realistic quantum algorithm that attains subquadratic time complexity $O(|E|sqrt{|P|})$.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116475193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Optimal LZ-End Parsing is Hard 最佳LZ-End解析是困难的
Pub Date : 2023-02-06 DOI: 10.48550/arXiv.2302.02586
H. Bannai, Mitsuru Funakoshi, Kazuhiro Kurita, Yuto Nakashima, Kazuhisa Seto, T. Uno
LZ-End is a variant of the well-known Lempel-Ziv parsing family such that each phrase of the parsing has a previous occurrence, with the additional constraint that the previous occurrence must end at the end of a previous phrase. LZ-End was initially proposed as a greedy parsing, where each phrase is determined greedily from left to right, as the longest factor that satisfies the above constraint~[Kreft&Navarro, 2010]. In this work, we consider an optimal LZ-End parsing that has the minimum number of phrases in such parsings. We show that a decision version of computing the optimal LZ-End parsing is NP-complete by showing a reduction from the vertex cover problem. Moreover, we give a MAX-SAT formulation for the optimal LZ-End parsing adapting an approach for computing various NP-hard repetitiveness measures recently presented by [Bannai et al., 2022]. We also consider the approximation ratio of the size of greedy LZ-End parsing to the size of the optimal LZ-End parsing, and give a lower bound of the ratio which asymptotically approaches $2$.
LZ-End是著名的Lempel-Ziv解析家族的一种变体,这样,解析的每个短语都有一个先前的事件,并且附加了一个约束,即先前的事件必须在前一个短语的末尾结束。LZ-End最初被提出为贪婪解析,其中每个短语从左到右贪婪地确定,作为满足上述约束的最长因子~[Kreft&Navarro, 2010]。在这项工作中,我们考虑了一种最优的LZ-End解析,它在这种解析中具有最少数量的短语。通过展示顶点覆盖问题的约简,我们证明了计算最优LZ-End解析的决策版本是np完全的。此外,我们给出了最佳LZ-End解析的MAX-SAT公式,该公式采用了最近由[Bannai等人,2022]提出的计算各种NP-hard重复度量的方法。我们还考虑了贪婪LZ-End解析的大小与最优LZ-End解析的大小的近似比值,并给出了该比值的下界,该比值渐近于$2$。
{"title":"Optimal LZ-End Parsing is Hard","authors":"H. Bannai, Mitsuru Funakoshi, Kazuhiro Kurita, Yuto Nakashima, Kazuhisa Seto, T. Uno","doi":"10.48550/arXiv.2302.02586","DOIUrl":"https://doi.org/10.48550/arXiv.2302.02586","url":null,"abstract":"LZ-End is a variant of the well-known Lempel-Ziv parsing family such that each phrase of the parsing has a previous occurrence, with the additional constraint that the previous occurrence must end at the end of a previous phrase. LZ-End was initially proposed as a greedy parsing, where each phrase is determined greedily from left to right, as the longest factor that satisfies the above constraint~[Kreft&Navarro, 2010]. In this work, we consider an optimal LZ-End parsing that has the minimum number of phrases in such parsings. We show that a decision version of computing the optimal LZ-End parsing is NP-complete by showing a reduction from the vertex cover problem. Moreover, we give a MAX-SAT formulation for the optimal LZ-End parsing adapting an approach for computing various NP-hard repetitiveness measures recently presented by [Bannai et al., 2022]. We also consider the approximation ratio of the size of greedy LZ-End parsing to the size of the optimal LZ-End parsing, and give a lower bound of the ratio which asymptotically approaches $2$.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115876430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Order-Preserving Squares in Strings 字符串中的保序平方
Pub Date : 2023-02-01 DOI: 10.48550/arXiv.2302.00724
Paweł Gawrychowski, Samah Ghazawi, G. M. Landau
An order-preserving square in a string is a fragment of the form $uv$ where $uneq v$ and $u$ is order-isomorphic to $v$. We show that a string $w$ of length $n$ over an alphabet of size $sigma$ contains $mathcal{O}(sigma n)$ order-preserving squares that are distinct as words. This improves the upper bound of $mathcal{O}(sigma^{2}n)$ by Kociumaka, Radoszewski, Rytter, and Wale'n [TCS 2016]. Further, for every $sigma$ and $n$ we exhibit a string with $Omega(sigma n)$ order-preserving squares that are distinct as words, thus establishing that our upper bound is asymptotically tight. Finally, we design an $mathcal{O}(sigma n)$ time algorithm that outputs all order-preserving squares that occur in a given string and are distinct as words. By our lower bound, this is optimal in the worst case.
字符串中的保序方形是$uv$形式的片段,其中$uneq v$和$u$与$v$是序同构的。我们展示了长度为$n$的字符串$w$在大小为$sigma$的字母表上包含$mathcal{O}(sigma n)$个保持顺序的不同于单词的正方形。这改进了Kociumaka、Radoszewski、Rytter和waleski [TCS 2016]提出的$mathcal{O}(sigma^{2}n)$的上界。此外,对于每个$sigma$和$n$,我们展示了一个具有$Omega(sigma n)$保持顺序的平方的字符串,这些平方与单词不同,从而建立了我们的上界是渐近紧密的。最后,我们设计了一个$mathcal{O}(sigma n)$ time算法,该算法输出在给定字符串中出现的所有保持顺序的正方形,并且作为单词是不同的。根据下界,这在最坏情况下是最优的。
{"title":"Order-Preserving Squares in Strings","authors":"Paweł Gawrychowski, Samah Ghazawi, G. M. Landau","doi":"10.48550/arXiv.2302.00724","DOIUrl":"https://doi.org/10.48550/arXiv.2302.00724","url":null,"abstract":"An order-preserving square in a string is a fragment of the form $uv$ where $uneq v$ and $u$ is order-isomorphic to $v$. We show that a string $w$ of length $n$ over an alphabet of size $sigma$ contains $mathcal{O}(sigma n)$ order-preserving squares that are distinct as words. This improves the upper bound of $mathcal{O}(sigma^{2}n)$ by Kociumaka, Radoszewski, Rytter, and Wale'n [TCS 2016]. Further, for every $sigma$ and $n$ we exhibit a string with $Omega(sigma n)$ order-preserving squares that are distinct as words, thus establishing that our upper bound is asymptotically tight. Finally, we design an $mathcal{O}(sigma n)$ time algorithm that outputs all order-preserving squares that occur in a given string and are distinct as words. By our lower bound, this is optimal in the worst case.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123152596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sliding Window String Indexing in Streams 流中的滑动窗口字符串索引
Pub Date : 2023-01-23 DOI: 10.48550/arXiv.2301.09477
P. Bille, J. Fischer, I. L. Gørtz, Max Rishøj Pedersen, Tord Stordalen
Given a string $S$ over an alphabet $Sigma$, the 'string indexing problem' is to preprocess $S$ to subsequently support efficient pattern matching queries, i.e., given a pattern string $P$ report all the occurrences of $P$ in $S$. In this paper we study the 'streaming sliding window string indexing problem'. Here the string $S$ arrives as a stream, one character at a time, and the goal is to maintain an index of the last $w$ characters, called the 'window', for a specified parameter $w$. At any point in time a pattern matching query for a pattern $P$ may arrive, also streamed one character at a time, and all occurrences of $P$ within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching. Our main result is a simple $O(w)$ space data structure that uses $O(log w)$ time with high probability to process each character from both the input string $S$ and the pattern string $P$. Reporting each occurrence from $P$ uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream. We also consider a delayed variant of the problem, where a query may be answered at any point within the next $delta$ characters that arrive from either stream. We present an $O(w + delta)$ space data structure for this problem that improves the above time bounds to $O(log(w/delta))$. In particular, for a delay of $delta = epsilon w$ we obtain an $O(w)$ space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees.
给定字母$Sigma$上的字符串$S$,“字符串索引问题”是预处理$S$以随后支持有效的模式匹配查询,即,给定模式字符串$P$报告$S$中所有$P$的出现情况。本文研究了“流滑动窗口字符串索引问题”。在这里,字符串$S$作为一个流到达,一次一个字符,目标是维护指定参数$w$的最后一个$w$字符的索引,称为“窗口”。在任何时间点,模式$P$的模式匹配查询都可能到达,也是一次流式传输一个字符,并且必须返回当前窗口中出现的所有$P$。流滑动窗口字符串索引问题自然地捕获了我们想要索引流的最新数据(即窗口),同时支持高效模式匹配的场景。我们的主要结果是一个简单的$O(w)$空间数据结构,它大概率地使用$O(log w)$时间来处理输入字符串$S$和模式字符串$P$中的每个字符。从$P$报告每个事件会使用额外的常量时间。与之前在类似场景中的工作相比,该结果首次实现了输入流中每个字符的有效最坏情况时间。我们还考虑了该问题的延迟变体,其中查询可以在来自任何一个流的下一个$delta$字符内的任何点得到回答。针对这个问题,我们提出了一个$O(w + delta)$空间数据结构,将上述时间界限提高到$O(log(w/delta))$。特别是,对于延迟$delta = epsilon w$,我们获得了一个$O(w)$空间数据结构,每个字符的处理时间是恒定的。实现我们的结果的关键思想是一个新颖而简单的独立兴趣后缀树的层次结构,灵感来自经典的日志结构合并树。
{"title":"Sliding Window String Indexing in Streams","authors":"P. Bille, J. Fischer, I. L. Gørtz, Max Rishøj Pedersen, Tord Stordalen","doi":"10.48550/arXiv.2301.09477","DOIUrl":"https://doi.org/10.48550/arXiv.2301.09477","url":null,"abstract":"Given a string $S$ over an alphabet $Sigma$, the 'string indexing problem' is to preprocess $S$ to subsequently support efficient pattern matching queries, i.e., given a pattern string $P$ report all the occurrences of $P$ in $S$. In this paper we study the 'streaming sliding window string indexing problem'. Here the string $S$ arrives as a stream, one character at a time, and the goal is to maintain an index of the last $w$ characters, called the 'window', for a specified parameter $w$. At any point in time a pattern matching query for a pattern $P$ may arrive, also streamed one character at a time, and all occurrences of $P$ within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching. Our main result is a simple $O(w)$ space data structure that uses $O(log w)$ time with high probability to process each character from both the input string $S$ and the pattern string $P$. Reporting each occurrence from $P$ uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream. We also consider a delayed variant of the problem, where a query may be answered at any point within the next $delta$ characters that arrive from either stream. We present an $O(w + delta)$ space data structure for this problem that improves the above time bounds to $O(log(w/delta))$. In particular, for a delay of $delta = epsilon w$ we obtain an $O(w)$ space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"61 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132238309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Parameterized Algorithms for String Matching to DAGs: Funnels and Beyond 字符串与dag匹配的参数化算法:漏斗及其他
Pub Date : 2022-12-15 DOI: 10.48550/arXiv.2212.07870
Manuel Cáceres
The problem of String Matching to Labeled Graphs (SMLG) asks to find all the paths in a labeled graph $G = (V, E)$ whose spellings match that of an input string $S in Sigma^m$. SMLG can be solved in quadratic $O(m|E|)$ time [Amir et al., JALG], which was proven to be optimal by a recent lower bound conditioned on SETH [Equi et al., ICALP 2019]. The lower bound states that no strongly subquadratic time algorithm exists, even if restricted to directed acyclic graphs (DAGs). In this work we present the first parameterized algorithms for SMLG in DAGs. Our parameters capture the topological structure of $G$. All our results are derived from a generalization of the Knuth-Morris-Pratt algorithm [Park and Kim, CPM 1995] optimized to work in time proportional to the number of prefix-incomparable matches. To obtain the parameterization in the topological structure of $G$, we first study a special class of DAGs called funnels [Millani et al., JCO] and generalize them to $k$-funnels and the class $ST_k$. We present several novel characterizations and algorithmic contributions on both funnels and their generalizations.
标记图的字符串匹配问题(SMLG)要求在标记图$G = (V, E)$中找到所有的路径,其拼写与输入字符串$S in Sigma^m$匹配。SMLG可以在二次$O(m|E|)$ time中求解[Amir等人,JALG],最近在SETH条件下的下界证明了它是最优的[Equi等人,ICALP 2019]。下界表明不存在强次二次时间算法,即使局限于有向无环图(dag)。在这项工作中,我们提出了在dag中SMLG的第一个参数化算法。我们的参数捕获了$G$的拓扑结构。我们所有的结果都来自于Knuth-Morris-Pratt算法的推广[Park和Kim, CPM 1995],优化后的算法与前缀不可比较匹配的数量成正比。为了获得$G$拓扑结构中的参数化,我们首先研究了一类特殊的dag,称为漏斗[Millani et al., JCO],并将其推广到$k$-漏斗和$ST_k$类。我们提出了几个新的表征和算法贡献在这两个漏斗和他们的推广。
{"title":"Parameterized Algorithms for String Matching to DAGs: Funnels and Beyond","authors":"Manuel Cáceres","doi":"10.48550/arXiv.2212.07870","DOIUrl":"https://doi.org/10.48550/arXiv.2212.07870","url":null,"abstract":"The problem of String Matching to Labeled Graphs (SMLG) asks to find all the paths in a labeled graph $G = (V, E)$ whose spellings match that of an input string $S in Sigma^m$. SMLG can be solved in quadratic $O(m|E|)$ time [Amir et al., JALG], which was proven to be optimal by a recent lower bound conditioned on SETH [Equi et al., ICALP 2019]. The lower bound states that no strongly subquadratic time algorithm exists, even if restricted to directed acyclic graphs (DAGs). In this work we present the first parameterized algorithms for SMLG in DAGs. Our parameters capture the topological structure of $G$. All our results are derived from a generalization of the Knuth-Morris-Pratt algorithm [Park and Kim, CPM 1995] optimized to work in time proportional to the number of prefix-incomparable matches. To obtain the parameterization in the topological structure of $G$, we first study a special class of DAGs called funnels [Millani et al., JCO] and generalize them to $k$-funnels and the class $ST_k$. We present several novel characterizations and algorithmic contributions on both funnels and their generalizations.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"461 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125811701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Merging Sorted Lists of Similar Strings 合并相似字符串的排序列表
Pub Date : 2022-08-19 DOI: 10.48550/arXiv.2208.09351
E. Myers
Merging $T$ sorted, non-redundant lists containing $M$ elements into a single sorted, non-redundant result of size $N ge M/T$ is a classic problem typically solved practically in $O(M log T)$ time with a priority-queue data structure the most basic of which is the simple *heap*. We revisit this problem in the situation where the list elements are *strings* and the lists contain many *identical or nearly identical elements*. By keeping simple auxiliary information with each heap node, we devise an $O(M log T+S)$ worst-case method that performs no more character comparisons than the sum of the lengths of all the strings $S$, and another $O(M log (T/ bar e)+S)$ method that becomes progressively more efficient as a function of the fraction of equal elements $bar e = M/N$ between input lists, reaching linear time when the lists are all identical. The methods perform favorably in practice versus an alternate formulation based on a trie.
将包含$M$元素的$T$排序、非冗余列表合并到大小为$N ge M/T$的单个排序、非冗余结果中是一个经典问题,通常在$O(M log T)$时间内实际解决,使用优先级队列数据结构,其中最基本的是简单的*堆*。在列表元素是“字符串”并且列表包含许多“相同或几乎相同的元素”的情况下,我们重新审视这个问题。通过保留每个堆节点的简单辅助信息,我们设计了一种$O(M log T+S)$最坏情况方法,它执行的字符比较不超过所有字符串长度之和$S$,而另一种$O(M log (T/ bar e)+S)$方法作为输入列表之间相等元素的分数的函数变得越来越高效$bar e = M/N$,当列表都相同时达到线性时间。与基于试验的替代配方相比,所述方法在实践中表现良好。
{"title":"Merging Sorted Lists of Similar Strings","authors":"E. Myers","doi":"10.48550/arXiv.2208.09351","DOIUrl":"https://doi.org/10.48550/arXiv.2208.09351","url":null,"abstract":"Merging $T$ sorted, non-redundant lists containing $M$ elements into a single sorted, non-redundant result of size $N ge M/T$ is a classic problem typically solved practically in $O(M log T)$ time with a priority-queue data structure the most basic of which is the simple *heap*. We revisit this problem in the situation where the list elements are *strings* and the lists contain many *identical or nearly identical elements*. By keeping simple auxiliary information with each heap node, we devise an $O(M log T+S)$ worst-case method that performs no more character comparisons than the sum of the lengths of all the strings $S$, and another $O(M log (T/ bar e)+S)$ method that becomes progressively more efficient as a function of the fraction of equal elements $bar e = M/N$ between input lists, reaching linear time when the lists are all identical. The methods perform favorably in practice versus an alternate formulation based on a trie.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133705462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
L-systems for Measuring Repetitiveness 测量重复性的l系统
Pub Date : 2022-06-03 DOI: 10.48550/arXiv.2206.01688
G. Navarro, Cristian Urbina
An L-system (for lossless compression) is a CPD0L-system extended with two parameters $d$ and $n$, which determines unambiguously a string $w = tau(varphi^d(s))[1:n]$, where $varphi$ is the morphism of the system, $s$ is its axiom, and $tau$ is its coding. The length of the shortest description of an L-system generating $w$ is known as $ell$, and is arguably a relevant measure of repetitiveness that builds on the self-similarities that arise in the sequence. In this paper we deepen the study of the measure $ell$ and its relation with $delta$, a better established lower bound that builds on substring complexity. Our results show that $ell$ and $delta$ are largely orthogonal, in the sense that one can be much larger than the other depending on the case. This suggests that both sources of repetitiveness are mostly unrelated. We also show that the recently introduced NU-systems, which combine the capabilities of L-systems with bidirectional macro-schemes, can be asymptotically strictly smaller than both mechanisms, which makes the size $nu$ of the smallest NU-system the unique smallest reachable repetitiveness measure to date.
L-system(用于无损压缩)是用两个参数$d$和$n$扩展的cpd0 -system,它明确地确定字符串$w = tau(varphi^d(s))[1:n]$,其中$varphi$是系统的态射,$s$是它的公理,$tau$是它的编码。生成$w$的l系统的最短描述的长度为$ell$,并且可以说是建立在序列中出现的自相似性基础上的重复性的相关度量。在本文中,我们深入研究了测度$ell$及其与$delta$的关系,是一个建立在子串复杂度上的较好的下界。我们的结果表明$ell$和$delta$在很大程度上是正交的,在某种意义上,根据情况,一个可能比另一个大得多。这表明重复的两种来源在很大程度上是不相关的。我们还证明了最近引入的将l系统的能力与双向宏观方案相结合的nu -系统可以渐近严格小于这两种机制,这使得最小nu -系统的尺寸$nu$成为迄今为止唯一的最小可达重复性度量。
{"title":"L-systems for Measuring Repetitiveness","authors":"G. Navarro, Cristian Urbina","doi":"10.48550/arXiv.2206.01688","DOIUrl":"https://doi.org/10.48550/arXiv.2206.01688","url":null,"abstract":"An L-system (for lossless compression) is a CPD0L-system extended with two parameters $d$ and $n$, which determines unambiguously a string $w = tau(varphi^d(s))[1:n]$, where $varphi$ is the morphism of the system, $s$ is its axiom, and $tau$ is its coding. The length of the shortest description of an L-system generating $w$ is known as $ell$, and is arguably a relevant measure of repetitiveness that builds on the self-similarities that arise in the sequence. In this paper we deepen the study of the measure $ell$ and its relation with $delta$, a better established lower bound that builds on substring complexity. Our results show that $ell$ and $delta$ are largely orthogonal, in the sense that one can be much larger than the other depending on the case. This suggests that both sources of repetitiveness are mostly unrelated. We also show that the recently introduced NU-systems, which combine the capabilities of L-systems with bidirectional macro-schemes, can be asymptotically strictly smaller than both mechanisms, which makes the size $nu$ of the smallest NU-system the unique smallest reachable repetitiveness measure to date.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114903478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Construction of the BWT for Repetitive Text Using String Compression 使用字符串压缩高效构建重复文本的BWT
Pub Date : 2022-04-12 DOI: 10.48550/arXiv.2204.05969
Diego Díaz-Domínguez, G. Navarro
We present a new semi-external algorithm that builds the Burrows--Wheeler transform variant of Bauer et al. (a.k.a., BCR BWT) in linear expected time. Our method uses compression techniques to reduce computational costs when the input is massive and repetitive. Concretely, we build on induced suffix sorting (ISS) and resort to run-length and grammar compression to maintain our intermediate results in compact form. Our compression format not only saves space but also speeds up the required computations. Our experiments show important space and computation time savings when the text is repetitive. In moderate-size collections of real human genome assemblies (14.2 GB - 75.05 GB), our memory peak is, on average, 1.7x smaller than the peak of the state-of-the-art BCR BWT construction algorithm (texttt{ropebwt2}), while running 5x faster. Our current implementation was also able to compute the BCR BWT of 400 real human genome assemblies (1.2 TB) in 41.21 hours using 118.83 GB of working memory (around 10% of the input size). Interestingly, the results we report in the 1.2 TB file are dominated by the difficulties of scanning huge files under memory constraints (specifically, I/O operations). This fact indicates we can perform much better with a more careful implementation of our method, thus scaling to even bigger sizes efficiently.
我们提出了一种新的半外部算法,该算法在线性期望时间内构建Bauer等人的Burrows- Wheeler变换变体(又称BCR BWT)。当输入大量且重复时,我们的方法使用压缩技术来减少计算成本。具体地说,我们建立在诱导后缀排序(ISS)的基础上,并使用运行长度和语法压缩来以紧凑的形式维护中间结果。我们的压缩格式不仅节省了空间,而且加快了所需的计算速度。我们的实验表明,当文本重复时,节省了重要的空间和计算时间。在中等大小的真实人类基因组集合(14.2 GB - 75.05 GB)中,我们的内存峰值平均比最先进的BCR BWT构建算法texttt{(ropebwt2)}的峰值小1.7倍,而运行速度快5倍。我们目前的实现还能够在41.21小时内使用118.83 GB的工作内存(约为输入大小的10%)计算400个真实人类基因组组装(1.2 TB)的BCR BWT。有趣的是,我们在1.2 TB文件中报告的结果主要是由于在内存限制(特别是I/O操作)下扫描大文件的困难。这一事实表明,通过更仔细地实现我们的方法,我们可以执行得更好,从而有效地扩展到更大的大小。
{"title":"Efficient Construction of the BWT for Repetitive Text Using String Compression","authors":"Diego Díaz-Domínguez, G. Navarro","doi":"10.48550/arXiv.2204.05969","DOIUrl":"https://doi.org/10.48550/arXiv.2204.05969","url":null,"abstract":"We present a new semi-external algorithm that builds the Burrows--Wheeler transform variant of Bauer et al. (a.k.a., BCR BWT) in linear expected time. Our method uses compression techniques to reduce computational costs when the input is massive and repetitive. Concretely, we build on induced suffix sorting (ISS) and resort to run-length and grammar compression to maintain our intermediate results in compact form. Our compression format not only saves space but also speeds up the required computations. Our experiments show important space and computation time savings when the text is repetitive. In moderate-size collections of real human genome assemblies (14.2 GB - 75.05 GB), our memory peak is, on average, 1.7x smaller than the peak of the state-of-the-art BCR BWT construction algorithm (texttt{ropebwt2}), while running 5x faster. Our current implementation was also able to compute the BCR BWT of 400 real human genome assemblies (1.2 TB) in 41.21 hours using 118.83 GB of working memory (around 10% of the input size). Interestingly, the results we report in the 1.2 TB file are dominated by the difficulties of scanning huge files under memory constraints (specifically, I/O operations). This fact indicates we can perform much better with a more careful implementation of our method, thus scaling to even bigger sizes efficiently.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133981482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Reduction ratio of the IS-algorithm: worst and random cases is算法的约简率:最坏情况和随机情况
Pub Date : 2022-04-09 DOI: 10.48550/arXiv.2204.04422
Vincent Jug'e
We study the IS-algorithm, a well-known linear-time algorithm for computing the suffix array of a word. This algorithm relies on transforming the input word w into another word, called the reduced word of w , that will be at least twice shorter; then, the algorithm recursively computes the suffix array of the reduced word. In this article, we study the reduction ratio of the IS-algorithm, i.e., the ratio between the lengths of the input word and the word obtained after reducing k times the input word. We investigate both worst cases, in which we find precise results, and random cases, where we prove some strong convergence phenomena. Finally, we prove that, if the input word is a randomly chosen word of length n , we should not expect much more than log(log( n )) recursive function calls.
我们研究了is算法,这是一种众所周知的用于计算单词后缀数组的线性时间算法。该算法依赖于将输入单词w转换为另一个单词,称为w的约简单词,该单词将至少缩短两倍;然后,算法递归地计算约简后单词的后缀数组。在本文中,我们研究is算法的约简比,即输入单词的长度与输入单词约简k次后得到的单词的长度之比。我们研究了最坏的情况,我们找到了精确的结果,和随机的情况,我们证明了一些强收敛现象。最后,我们证明,如果输入单词是一个长度为n的随机选择的单词,我们不应该期望超过log(log(n))递归函数调用。
{"title":"Reduction ratio of the IS-algorithm: worst and random cases","authors":"Vincent Jug'e","doi":"10.48550/arXiv.2204.04422","DOIUrl":"https://doi.org/10.48550/arXiv.2204.04422","url":null,"abstract":"We study the IS-algorithm, a well-known linear-time algorithm for computing the suffix array of a word. This algorithm relies on transforming the input word w into another word, called the reduced word of w , that will be at least twice shorter; then, the algorithm recursively computes the suffix array of the reduced word. In this article, we study the reduction ratio of the IS-algorithm, i.e., the ratio between the lengths of the input word and the word obtained after reducing k times the input word. We investigate both worst cases, in which we find precise results, and random cases, where we prove some strong convergence phenomena. Finally, we prove that, if the input word is a randomly chosen word of length n , we should not expect much more than log(log( n )) recursive function calls.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125524107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A theoretical and experimental analysis of BWT variants for string collections 字符串集合中BWT变量的理论和实验分析
Pub Date : 2022-02-26 DOI: 10.4230/LIPIcs.CPM.2022.25
David Cenzato, Zsuzsanna Lipt'ak
The extended Burrows-Wheeler-Transform (eBWT), introduced by Mantaci et al. [Theor. Comput. Sci., 2007], is a generalization of the Burrows-Wheeler-Transform (BWT) to multisets of strings. While the original BWT is based on the lexicographic order, the eBWT uses the omega-order, which differs from the lexicographic order in important ways. A number of tools are available that compute the BWT of string collections; however, the data structures they generate in most cases differ from the one originally defined, as well as from each other. In this paper, we review the differences between these BWT variants, both from a theoretical and from a practical point of view, comparing them on several real-life datasets with different characteristics. We find that the differences can be extensive, depending on the dataset characteristics, and are largest on collections of many highly similar short sequences. The widely-used parameter $r$, the number of runs of the BWT, also shows notable variation between the different BWT variants; on our datasets, it varied by a multiplicative factor of up to $4.2$.
扩展Burrows-Wheeler-Transform (eBWT),由Mantaci等人提出。第一版。科学。[j], 2007],是burrows - wheeler变换(BWT)对多组字符串的推广。最初的BWT是基于字典顺序的,而eBWT使用omega- 3顺序,它在许多重要方面与字典顺序不同。有许多工具可用于计算字符串集合的BWT;但是,在大多数情况下,它们生成的数据结构与最初定义的数据结构不同,而且彼此之间也不同。在本文中,我们从理论和实践的角度回顾了这些BWT变体之间的差异,并在具有不同特征的几个现实数据集上进行了比较。我们发现差异可以是广泛的,这取决于数据集的特征,并且在许多高度相似的短序列的集合中最大。广泛使用的参数r,即BWT的运行次数,也显示出不同BWT变体之间的显著差异;在我们的数据集上,它以高达4.2美元的倍数因子变化。
{"title":"A theoretical and experimental analysis of BWT variants for string collections","authors":"David Cenzato, Zsuzsanna Lipt'ak","doi":"10.4230/LIPIcs.CPM.2022.25","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2022.25","url":null,"abstract":"The extended Burrows-Wheeler-Transform (eBWT), introduced by Mantaci et al. [Theor. Comput. Sci., 2007], is a generalization of the Burrows-Wheeler-Transform (BWT) to multisets of strings. While the original BWT is based on the lexicographic order, the eBWT uses the omega-order, which differs from the lexicographic order in important ways. A number of tools are available that compute the BWT of string collections; however, the data structures they generate in most cases differ from the one originally defined, as well as from each other. In this paper, we review the differences between these BWT variants, both from a theoretical and from a practical point of view, comparing them on several real-life datasets with different characteristics. We find that the differences can be extensive, depending on the dataset characteristics, and are largest on collections of many highly similar short sequences. The widely-used parameter $r$, the number of runs of the BWT, also shows notable variation between the different BWT variants; on our datasets, it varied by a multiplicative factor of up to $4.2$.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125616717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Annual Symposium on Combinatorial Pattern Matching
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1